Data quality filtering pipeline visualization

Common Crawl 2024: Quality Analysis of 3.4 Petabytes of Web Data

Published February 10, 2025 · Data Quality Team · 15 min read · 4,800 views

Common CrawlData QualityLLM TrainingDeduplicationOpen Data

1. Introduction

Common Crawl remains the single most important open data source for training large language models. From GPT-4 to Llama 3 to DeepSeek-V3, virtually every frontier model includes Common Crawl data in its pretraining corpus. But not all web data is created equal — in fact, the majority of raw Common Crawl data is noise, duplication, spam, or low-quality content that can actively harm model performance.

In this analysis, we processed the complete CC-2024-18 snapshot — 3.4 petabytes of raw web data comprising approximately 3.15 billion web pages — through a multi-stage quality filtering pipeline. Our goal: quantify exactly how much usable training data exists in the latest Common Crawl release, and how quality has evolved over time.

Key Finding

After full deduplication and quality filtering, only 23.4% of raw CC-2024-18 content meets our quality threshold for LLM pretraining. However, this represents an 18% improvement over CC-2023-14, driven primarily by the natural decline of low-quality spam sites and improved content diversity.

2. Methodology

2.1 Data Processing Pipeline

Our pipeline processes Common Crawl data in five sequential stages:

Language Detection: FastText-based language identification, retaining documents in 42 languages with confidence > 0.65
Boilerplate Removal: trafilatura + custom heuristics for removing navigation, ads, cookie banners, and repetitive page elements
Near-Duplicate Detection: MinHash LSH with 128 hash functions and Jaccard similarity threshold of 0.8
Quality Scoring: Linear classifier trained on 45,000 human-rated documents, scoring text coherence, information density, formatting quality, and educational value
Safety Filtering: Toxic content detection (Perspective API scores), PII pattern matching, known malware domain exclusion

2.2 Infrastructure

Processing was distributed across 512 AMD EPYC nodes (128 cores each) on AWS, taking approximately 72 hours per complete pass. Total compute cost: ~$18,400 for the full pipeline run.

3. Results: CC-2024-18 at a Glance

Metric	CC-2024-18	CC-2023-14	Change
Raw pages	3.15B	2.89B	+9%
Raw size (compressed)	3.4 PB	3.1 PB	+10%
Extracted text tokens	3.8T	3.4T	+12%
After language filter	3.2T (84%)	2.8T (82%)	+2pp
After dedup	2.2T (58%)	1.8T (53%)	+5pp
After quality filter	890B (23.4%)	680B (20.0%)	+3.4pp
After safety filter	845B (22.2%)	638B (18.8%)	+3.4pp
Near-duplicate rate	31.2%	34.8%	-3.6pp
Median quality score	0.47	0.41	+14.6%
Toxic content rate	4.2%	5.1%	-0.9pp

4. Duplicate Analysis

Near-duplication remains the biggest source of waste in Common Crawl data. Our MinHash analysis found that 31.2% of extracted text has a near-duplicate elsewhere in the corpus — slightly improved from 34.8% the previous year.

4.1 Duplication by Domain Type

Domain Category	Pages	Dup Rate	Top Source
E-commerce product pages	480M	62%	Amazon, eBay templated listings
News/media	310M	41%	Wire service syndication (AP, Reuters)
Forum/Q&A	290M	28%	Reddit mirrors, Stack Exchange scrapers
Technical documentation	180M	19%	Versioned docs (same content, different versions)
Academic/research	95M	12%	PDF extracts, preprint mirrors
Government/institutional	65M	8%	Template-based government reports

5. Quality Distribution

We scored every document on a 0-1 quality scale using our classifier. The distribution reveals a clear bimodal pattern:

Low-quality peak (0.1-0.2): Machine-generated SEO content, cookie-cutter product descriptions, autogenerated pages
Mid-quality plateau (0.3-0.5): User-generated content (forums, reviews), short news articles, press releases
High-quality peak (0.7-0.9): Long-form articles, academic content, technical documentation, well-edited Wikipedia-style text

Interesting Pattern

Documents containing structured data (JSON-LD, schema.org markup, tables) score 34% higher on average than documents without structured data. This suggests a strong correlation between technical sophistication of a webpage and the quality of its text content — a pattern actively exploited by crawlers like Applebot-Extended and Google-Extended.

6. Language Distribution

Language	Tokens (after filtering)	Share	YoY Change
English	412B	48.8%	-2.1pp
Chinese (Simplified)	68B	8.1%	+1.4pp
German	42B	5.0%	-0.2pp
Japanese	38B	4.5%	+0.3pp
French	35B	4.1%	-0.1pp
Russian	33B	3.9%	-0.5pp
Spanish	31B	3.7%	+0.2pp
Korean	19B	2.3%	+0.4pp
Portuguese	18B	2.1%	+0.1pp
Other (33 langs)	149B	17.6%	+0.5pp

7. Open-Source Release

We release the following artifacts to the community:

# Download quality scores for CC-2024-18
curl https://www.aimegacity.xyz/v2/datasets?type=text

# Quality filtering scripts (Python)
git clone https://github.com/ai-megacity/cc-quality-filter

# Pre-computed MinHash signatures for deduplication
wget https://data.aimegacity.xyz/cc2024/minhash-signatures.tar.gz

8. Implications for LLM Training

Our findings have several practical implications for teams training language models:

Quality > Quantity: Using only the top 23% of CC data by quality score can match or exceed the performance of using the full dataset, while reducing compute costs by 4x
Deduplication is essential: Models trained on deduplicated data show 12-15% improvement on memorization benchmarks and better generalization
Multilingual opportunity: Chinese and Korean web content quality improved fastest — teams building multilingual models should re-evaluate their CC filtering thresholds
Structured data signal: Pages with schema.org markup are a reliable quality signal that can be used as a lightweight prefilter before expensive classifier-based scoring

Related Resources

Training Datasets Catalog

Browse 1,247 open datasets including filtered Common Crawl variants

Datasets API — GET /v2/datasets

Programmatic access to dataset metadata, quality scores, and download links

Synthetic Data vs. Web Data: Which Makes Better LLMs?

How does filtered CC data compare to synthetic training corpora?