#	Model	Type	MMLU	HumanEval	MATH	MT-Bench	BBH	Overall
🥇	o3 (high compute) OpenAI	Closed	91.4	97.9	96.7	9.6	88.4	94.8
🥈	Claude 3.5 Sonnet (2410) Anthropic	Closed	88.3	92.0	71.1	9.4	87.5	87.7
🥉	GPT-4o (2024-11-20) OpenAI	Closed	88.7	90.2	76.6	9.1	84.1	85.9
4	DeepSeek-R1 DeepSeek	Open	90.8	92.3	97.3	9.0	86.7	85.4
5	Gemini 1.5 Ultra Google DeepMind	Closed	90.0	84.1	58.5	9.1	83.2	83.0
6	Grok-2 xAI	Closed	87.5	88.4	76.1	8.9	82.0	82.4
7	Llama 3.1 405B Meta	Open	87.3	89.0	73.8	8.9	81.3	80.9
8	Mistral Large 2 Mistral AI	Open	84.0	92.1	69.9	8.6	80.1	79.5
9	Qwen2.5 72B Instruct Alibaba	Open	86.1	86.2	83.1	8.8	79.4	78.8
10	Apple AFM (Server) Apple	Closed	N/A	N/A	N/A	N/A	N/A	Not publicly benchmarked
11	Command R+ Cohere	Open	75.7	69.3	N/A	8.4	74.2	72.4
12	Apple AFM (On-Device 3B) Apple	Closed	60.9	N/A	N/A	N/A	N/A	On-device: limited benchmarks

⚠️ Scores reflect public evaluations as of March 2025. Some models (Apple AFM, certain Gemini variants) have not released full benchmark results. "Overall" is a weighted composite of available scores. Raw JSON available at /v2/models.

AI Model Leaderboard 2025