Testing if models follow instructions correctly: System prompt asks for numbers only, but user prompt requests words. Higher scores indicate better instruction following.
Score represents how well each model followed the system instruction to respond in numbers only
| Model | Score | Pass/Fail | Latency (ms) | Total Tokens |
|---|