# Model Type MMLU HumanEval MATH MT-Bench BBH Overall
🥇
o3 (high compute)
OpenAI
Closed 91.4 97.9 96.7 9.6 88.4
94.8
🥈
Claude 3.5 Sonnet (2410)
Anthropic
Closed 88.3 92.0 71.1 9.4 87.5
87.7
🥉
GPT-4o (2024-11-20)
OpenAI
Closed 88.7 90.2 76.6 9.1 84.1
85.9
4
DeepSeek-R1
DeepSeek
Open 90.8 92.3 97.3 9.0 86.7
85.4
5
Gemini 1.5 Ultra
Google DeepMind
Closed 90.0 84.1 58.5 9.1 83.2
83.0
6
Grok-2
xAI
Closed 87.5 88.4 76.1 8.9 82.0
82.4
7
Llama 3.1 405B
Meta
Open 87.3 89.0 73.8 8.9 81.3
80.9
8
Mistral Large 2
Mistral AI
Open 84.0 92.1 69.9 8.6 80.1
79.5
9
Qwen2.5 72B Instruct
Alibaba
Open 86.1 86.2 83.1 8.8 79.4
78.8
10
Apple AFM (Server)
Apple
Closed N/A N/A N/A N/A N/A
Not publicly benchmarked
11
Command R+
Cohere
Open 75.7 69.3 N/A 8.4 74.2
72.4
12
Apple AFM (On-Device 3B)
Apple
Closed 60.9 N/A N/A N/A N/A
On-device: limited benchmarks

⚠️ Scores reflect public evaluations as of March 2025. Some models (Apple AFM, certain Gemini variants) have not released full benchmark results. "Overall" is a weighted composite of available scores. Raw JSON available at /v2/models.