Instruction Following Test Results

Testing if models follow instructions correctly: System prompt asks for numbers only, but user prompt requests words. Higher scores indicate better instruction following.

Model Performance

Score represents how well each model followed the system instruction to respond in numbers only

Detailed Results

Model	Score	Pass/Fail	Latency (ms)	Total Tokens