Common Crawl 2024: Quality Analysis of 3.4 Petabytes of Web Data
1. Introduction
Common Crawl remains the single most important open data source for training large language models. From GPT-4 to Llama 3 to DeepSeek-V3, virtually every frontier model includes Common Crawl data in its pretraining corpus. But not all web data is created equal — in fact, the majority of raw Common Crawl data is noise, duplication, spam, or low-quality content that can actively harm model performance.
In this analysis, we processed the complete CC-2024-18 snapshot — 3.4 petabytes of raw web data comprising approximately 3.15 billion web pages — through a multi-stage quality filtering pipeline. Our goal: quantify exactly how much usable training data exists in the latest Common Crawl release, and how quality has evolved over time.
After full deduplication and quality filtering, only 23.4% of raw CC-2024-18 content meets our quality threshold for LLM pretraining. However, this represents an 18% improvement over CC-2023-14, driven primarily by the natural decline of low-quality spam sites and improved content diversity.
2. Methodology
2.1 Data Processing Pipeline
Our pipeline processes Common Crawl data in five sequential stages:
- Language Detection: FastText-based language identification, retaining documents in 42 languages with confidence > 0.65
- Boilerplate Removal: trafilatura + custom heuristics for removing navigation, ads, cookie banners, and repetitive page elements
- Near-Duplicate Detection: MinHash LSH with 128 hash functions and Jaccard similarity threshold of 0.8
- Quality Scoring: Linear classifier trained on 45,000 human-rated documents, scoring text coherence, information density, formatting quality, and educational value
- Safety Filtering: Toxic content detection (Perspective API scores), PII pattern matching, known malware domain exclusion
2.2 Infrastructure
Processing was distributed across 512 AMD EPYC nodes (128 cores each) on AWS, taking approximately 72 hours per complete pass. Total compute cost: ~$18,400 for the full pipeline run.
3. Results: CC-2024-18 at a Glance
| Metric | CC-2024-18 | CC-2023-14 | Change |
|---|---|---|---|
| Raw pages | 3.15B | 2.89B | +9% |
| Raw size (compressed) | 3.4 PB | 3.1 PB | +10% |
| Extracted text tokens | 3.8T | 3.4T | +12% |
| After language filter | 3.2T (84%) | 2.8T (82%) | +2pp |
| After dedup | 2.2T (58%) | 1.8T (53%) | +5pp |
| After quality filter | 890B (23.4%) | 680B (20.0%) | +3.4pp |
| After safety filter | 845B (22.2%) | 638B (18.8%) | +3.4pp |
| Near-duplicate rate | 31.2% | 34.8% | -3.6pp |
| Median quality score | 0.47 | 0.41 | +14.6% |
| Toxic content rate | 4.2% | 5.1% | -0.9pp |
4. Duplicate Analysis
Near-duplication remains the biggest source of waste in Common Crawl data. Our MinHash analysis found that 31.2% of extracted text has a near-duplicate elsewhere in the corpus — slightly improved from 34.8% the previous year.
4.1 Duplication by Domain Type
| Domain Category | Pages | Dup Rate | Top Source |
|---|---|---|---|
| E-commerce product pages | 480M | 62% | Amazon, eBay templated listings |
| News/media | 310M | 41% | Wire service syndication (AP, Reuters) |
| Forum/Q&A | 290M | 28% | Reddit mirrors, Stack Exchange scrapers |
| Technical documentation | 180M | 19% | Versioned docs (same content, different versions) |
| Academic/research | 95M | 12% | PDF extracts, preprint mirrors |
| Government/institutional | 65M | 8% | Template-based government reports |
5. Quality Distribution
We scored every document on a 0-1 quality scale using our classifier. The distribution reveals a clear bimodal pattern:
- Low-quality peak (0.1-0.2): Machine-generated SEO content, cookie-cutter product descriptions, autogenerated pages
- Mid-quality plateau (0.3-0.5): User-generated content (forums, reviews), short news articles, press releases
- High-quality peak (0.7-0.9): Long-form articles, academic content, technical documentation, well-edited Wikipedia-style text
Documents containing structured data (JSON-LD, schema.org markup, tables) score 34% higher on average than documents without structured data. This suggests a strong correlation between technical sophistication of a webpage and the quality of its text content — a pattern actively exploited by crawlers like Applebot-Extended and Google-Extended.
6. Language Distribution
| Language | Tokens (after filtering) | Share | YoY Change |
|---|---|---|---|
| English | 412B | 48.8% | -2.1pp |
| Chinese (Simplified) | 68B | 8.1% | +1.4pp |
| German | 42B | 5.0% | -0.2pp |
| Japanese | 38B | 4.5% | +0.3pp |
| French | 35B | 4.1% | -0.1pp |
| Russian | 33B | 3.9% | -0.5pp |
| Spanish | 31B | 3.7% | +0.2pp |
| Korean | 19B | 2.3% | +0.4pp |
| Portuguese | 18B | 2.1% | +0.1pp |
| Other (33 langs) | 149B | 17.6% | +0.5pp |
7. Open-Source Release
We release the following artifacts to the community:
# Download quality scores for CC-2024-18
curl https://www.aimegacity.xyz/v2/datasets?type=text
# Quality filtering scripts (Python)
git clone https://github.com/ai-megacity/cc-quality-filter
# Pre-computed MinHash signatures for deduplication
wget https://data.aimegacity.xyz/cc2024/minhash-signatures.tar.gz
8. Implications for LLM Training
Our findings have several practical implications for teams training language models:
- Quality > Quantity: Using only the top 23% of CC data by quality score can match or exceed the performance of using the full dataset, while reducing compute costs by 4x
- Deduplication is essential: Models trained on deduplicated data show 12-15% improvement on memorization benchmarks and better generalization
- Multilingual opportunity: Chinese and Korean web content quality improved fastest — teams building multilingual models should re-evaluate their CC filtering thresholds
- Structured data signal: Pages with schema.org markup are a reliable quality signal that can be used as a lightweight prefilter before expensive classifier-based scoring