Web data vs synthetic data comparison

Synthetic Data vs. Web Data: Which Makes Better LLMs?

Published December 20, 2024 · AI Megacity Research Team · 14 min read · 5,500 views
Synthetic DataWeb DataLLM TrainingPhi-3Apple AFMData Quality

1. The Great Data Debate

The AI industry faces a fundamental tension: the best training data is expensive, legally complicated, and finite — while the demand for training data grows exponentially with each model generation. This tension has driven a seismic shift toward synthetic data: training data generated by AI models themselves rather than collected from the web.

Microsoft's Phi-3, Apple's AFM foundation models, and Google's Gemma all use synthetic data as a significant portion of their training corpora. But does AI-generated training data actually produce better models than the messy, diverse, organic data of the web? We synthesized findings from 14 published studies and 3 unreleased internal reports to build a comprehensive answer.

TL;DR

Synthetic data wins on reasoning, instruction following, and code generation. Web data wins on factual diversity, long-tail knowledge, and cultural/linguistic breadth. The best models use a carefully calibrated mix of both.

2. What Is Synthetic Training Data?

Synthetic data for LLM training comes in several flavors:

3. Head-to-Head Comparison

Benchmark CategoryWeb Data Winner?Synthetic Data Winner?Notes
MMLU (general knowledge)✅ +3-5ppWeb data provides broader factual coverage
MATH / GSM8K✅ +8-12ppSynthetic reasoning traces are highly effective
HumanEval (code)✅ +5-8ppCurated synthetic code > random GitHub code
MT-Bench (conversation)✅ +0.3-0.5Synthetic instruction data produces more helpful responses
TriviaQA (factual recall)✅ +6-9ppWeb data contains long-tail facts synthetic can't generate
Multilingual (non-English)✅ +4-7ppSynthetic data is English-centric; web data covers 100+ languages
Toxicity avoidance✅ 2x betterSynthetic data can be generated clean; web data needs heavy filtering
Creative writing✅ preferredWeb-trained models produce more diverse, surprising prose
Instruction following✅ +10-15ppSynthetic instructions are more precise and diverse
Long-form coherence~Equal~EqualDepends more on architecture than data source

4. Case Studies

4.1 Microsoft Phi-3: The Synthetic Data Poster Child

Phi-3 Mini (3.8B parameters) achieves performance comparable to models 10× its size, trained primarily on synthetic "textbook" data:

4.2 Apple AFM: Privacy-Driven Synthetic Strategy

Apple's choice to heavily use synthetic data is partially driven by privacy concerns:

Apple's Hybrid Strategy

Apple appears to use a deliberate two-source strategy: Applebot web data for knowledge (facts, entities, relationships) and synthetic data for behavior (instruction following, safety, tone). This explains why Applebot-Extended disproportionately targets technical, factual content rather than conversational text.

4.3 DeepSeek: Synthetic Reasoning at Scale

DeepSeek-R1's approach represents a new frontier in synthetic data:

5. The Model Collapse Problem

A critical concern with synthetic data: model collapse. When models are trained on data generated by other models (or themselves), quality can degrade over generations:

Recent research (Shumailov et al., 2023; Alemohammad et al., 2024) shows that training exclusively on synthetic data for more than 3-4 generations leads to measurable quality degradation. The solution: always mix in fresh, real-world web data to maintain distributional diversity.

6. Cost Comparison

FactorWeb DataSynthetic Data
Collection cost (per TB)$50-200 (crawling infra)$1,000-10,000 (GPU compute)
Curation costHigh (filtering, dedup, safety)Low (generated clean)
Legal riskModerate-High (copyright, GDPR)Low (self-generated)
FreshnessReal-time (with re-crawling)Static (reflects teacher model's cutoff)
DiversityVery high (billions of authors)Low-Medium (limited by teacher model)
Quality consistencyLow (requires heavy filtering)High (controlled generation)
Scale limit~15-20T tokens (diminishing returns)Potentially unlimited (but collapse risk)

7. The Optimal Mix

Based on our analysis, the optimal training data composition for a frontier model in 2025 is approximately:

8. Implications for the Web

The rise of synthetic data doesn't eliminate the need for web crawling — it changes what crawlers are looking for. AI bots are increasingly seeking:

  1. Novel factual content that synthetic data can't generate (recent events, new discoveries)
  2. Diverse perspectives across cultures, languages, and viewpoints
  3. Structured data (JSON-LD, tables, lists) that can be directly converted to training format
  4. Expert content (technical docs, research papers) for domain-specific knowledge

This shift explains why Applebot-Extended traffic on technical and research content has surged 840% while visits to generic blog posts and news articles have plateaued.

Related Resources