Data quality filtering pipeline visualization

Common Crawl 2024: Quality Analysis of 3.4 Petabytes of Web Data

Published February 10, 2025 · Data Quality Team · 15 min read · 4,800 views
Common CrawlData QualityLLM TrainingDeduplicationOpen Data

1. Introduction

Common Crawl remains the single most important open data source for training large language models. From GPT-4 to Llama 3 to DeepSeek-V3, virtually every frontier model includes Common Crawl data in its pretraining corpus. But not all web data is created equal — in fact, the majority of raw Common Crawl data is noise, duplication, spam, or low-quality content that can actively harm model performance.

In this analysis, we processed the complete CC-2024-18 snapshot — 3.4 petabytes of raw web data comprising approximately 3.15 billion web pages — through a multi-stage quality filtering pipeline. Our goal: quantify exactly how much usable training data exists in the latest Common Crawl release, and how quality has evolved over time.

Key Finding

After full deduplication and quality filtering, only 23.4% of raw CC-2024-18 content meets our quality threshold for LLM pretraining. However, this represents an 18% improvement over CC-2023-14, driven primarily by the natural decline of low-quality spam sites and improved content diversity.

2. Methodology

2.1 Data Processing Pipeline

Our pipeline processes Common Crawl data in five sequential stages:

  1. Language Detection: FastText-based language identification, retaining documents in 42 languages with confidence > 0.65
  2. Boilerplate Removal: trafilatura + custom heuristics for removing navigation, ads, cookie banners, and repetitive page elements
  3. Near-Duplicate Detection: MinHash LSH with 128 hash functions and Jaccard similarity threshold of 0.8
  4. Quality Scoring: Linear classifier trained on 45,000 human-rated documents, scoring text coherence, information density, formatting quality, and educational value
  5. Safety Filtering: Toxic content detection (Perspective API scores), PII pattern matching, known malware domain exclusion

2.2 Infrastructure

Processing was distributed across 512 AMD EPYC nodes (128 cores each) on AWS, taking approximately 72 hours per complete pass. Total compute cost: ~$18,400 for the full pipeline run.

3. Results: CC-2024-18 at a Glance

MetricCC-2024-18CC-2023-14Change
Raw pages3.15B2.89B+9%
Raw size (compressed)3.4 PB3.1 PB+10%
Extracted text tokens3.8T3.4T+12%
After language filter3.2T (84%)2.8T (82%)+2pp
After dedup2.2T (58%)1.8T (53%)+5pp
After quality filter890B (23.4%)680B (20.0%)+3.4pp
After safety filter845B (22.2%)638B (18.8%)+3.4pp
Near-duplicate rate31.2%34.8%-3.6pp
Median quality score0.470.41+14.6%
Toxic content rate4.2%5.1%-0.9pp

4. Duplicate Analysis

Near-duplication remains the biggest source of waste in Common Crawl data. Our MinHash analysis found that 31.2% of extracted text has a near-duplicate elsewhere in the corpus — slightly improved from 34.8% the previous year.

4.1 Duplication by Domain Type

Domain CategoryPagesDup RateTop Source
E-commerce product pages480M62%Amazon, eBay templated listings
News/media310M41%Wire service syndication (AP, Reuters)
Forum/Q&A290M28%Reddit mirrors, Stack Exchange scrapers
Technical documentation180M19%Versioned docs (same content, different versions)
Academic/research95M12%PDF extracts, preprint mirrors
Government/institutional65M8%Template-based government reports

5. Quality Distribution

We scored every document on a 0-1 quality scale using our classifier. The distribution reveals a clear bimodal pattern:

Interesting Pattern

Documents containing structured data (JSON-LD, schema.org markup, tables) score 34% higher on average than documents without structured data. This suggests a strong correlation between technical sophistication of a webpage and the quality of its text content — a pattern actively exploited by crawlers like Applebot-Extended and Google-Extended.

6. Language Distribution

LanguageTokens (after filtering)ShareYoY Change
English412B48.8%-2.1pp
Chinese (Simplified)68B8.1%+1.4pp
German42B5.0%-0.2pp
Japanese38B4.5%+0.3pp
French35B4.1%-0.1pp
Russian33B3.9%-0.5pp
Spanish31B3.7%+0.2pp
Korean19B2.3%+0.4pp
Portuguese18B2.1%+0.1pp
Other (33 langs)149B17.6%+0.5pp

7. Open-Source Release

We release the following artifacts to the community:

# Download quality scores for CC-2024-18
curl https://www.aimegacity.xyz/v2/datasets?type=text

# Quality filtering scripts (Python)
git clone https://github.com/ai-megacity/cc-quality-filter

# Pre-computed MinHash signatures for deduplication
wget https://data.aimegacity.xyz/cc2024/minhash-signatures.tar.gz

8. Implications for LLM Training

Our findings have several practical implications for teams training language models:

Related Resources