AI Model Leaderboard 2025
Ranked by composite score across MMLU, HumanEval, MATH, MT-Bench, and BBH. Scores are community-verified against official papers and trusted third-party evaluations.
🟢 Last updated: March 1, 2025 · 52 models tracked · Download JSON →
| # | Model | Type | MMLU | HumanEval | MATH | MT-Bench | BBH | Overall |
|---|---|---|---|---|---|---|---|---|
| 🥇 |
o3 (high compute)
OpenAI
|
Closed | 91.4 | 97.9 | 96.7 | 9.6 | 88.4 | |
| 🥈 |
Claude 3.5 Sonnet (2410)
Anthropic
|
Closed | 88.3 | 92.0 | 71.1 | 9.4 | 87.5 | |
| 🥉 |
GPT-4o (2024-11-20)
OpenAI
|
Closed | 88.7 | 90.2 | 76.6 | 9.1 | 84.1 | |
| 4 |
DeepSeek-R1
DeepSeek
|
Open | 90.8 | 92.3 | 97.3 | 9.0 | 86.7 | |
| 5 |
Gemini 1.5 Ultra
Google DeepMind
|
Closed | 90.0 | 84.1 | 58.5 | 9.1 | 83.2 | |
| 6 |
Grok-2
xAI
|
Closed | 87.5 | 88.4 | 76.1 | 8.9 | 82.0 | |
| 7 |
Llama 3.1 405B
Meta
|
Open | 87.3 | 89.0 | 73.8 | 8.9 | 81.3 | |
| 8 |
Mistral Large 2
Mistral AI
|
Open | 84.0 | 92.1 | 69.9 | 8.6 | 80.1 | |
| 9 |
Qwen2.5 72B Instruct
Alibaba
|
Open | 86.1 | 86.2 | 83.1 | 8.8 | 79.4 | |
| 10 |
Apple AFM (Server)
Apple
|
Closed | N/A | N/A | N/A | N/A | N/A |
Not publicly benchmarked
|
| 11 |
Command R+
Cohere
|
Open | 75.7 | 69.3 | N/A | 8.4 | 74.2 | |
| 12 |
Apple AFM (On-Device 3B)
Apple
|
Closed | 60.9 | N/A | N/A | N/A | N/A |
On-device: limited benchmarks
|
⚠️ Scores reflect public evaluations as of March 2025. Some models (Apple AFM, certain Gemini variants) have not released full benchmark results. "Overall" is a weighted composite of available scores. Raw JSON available at /v2/models.