Instruction Following Test Results

Testing if models follow instructions correctly: System prompt asks for numbers only, but user prompt requests words. Higher scores indicate better instruction following.

Model Performance

Score represents how well each model followed the system instruction to respond in numbers only

Detailed Results
Model Score Pass/Fail Latency (ms) Total Tokens