DeepSeek-V3 neural network training visualization

DeepSeek-V3 Training Data: What 14.8 Trillion Tokens Looks Like

Published January 28, 2025 · ML Systems Lab · 11 min read · 6,200 views
DeepSeek-V3Training Data671B MoEChinese AIOpen Source

1. The DeepSeek Phenomenon

DeepSeek has emerged as one of the most impressive AI labs globally, producing frontier-quality models at a fraction of the cost reported by Western labs. DeepSeek-V3, their 671B Mixture-of-Experts model released in December 2024, was trained on just 14.8 trillion tokens — and its total training cost was reportedly under $6 million in compute, versus hundreds of millions for comparable Western models.

This report analyzes the composition of DeepSeek-V3's training corpus based on their published technical report, cross-referenced with our own analysis of publicly detectable data collection patterns.

Cost Efficiency

DeepSeek-V3's training cost of ~$5.6M for a 671B MoE model represents approximately 1/30th the estimated cost of training GPT-4. This was achieved through hardware-level optimizations on NVIDIA H800 GPUs (the China-export-restricted variant of H100s with reduced interconnect bandwidth), custom FP8 mixed-precision training, and a multi-token prediction objective.

2. Training Data Composition

Based on the DeepSeek-V3 technical report and our reverse analysis of Bytespider crawling patterns, we estimate the following training data breakdown:

Data CategoryEst. TokensShareKey Sources
Web text (general)8.88T~60%Common Crawl, proprietary Chinese web crawl
Code2.22T~15%GitHub (public repos), GitCode, StackOverflow
Books & long-form1.48T~10%Public domain books, Chinese literature databases
Mathematics1.18T~8%OpenWebMath, synthetic proofs, textbook exercises
Scientific papers740B~5%arXiv, Semantic Scholar, CNKI (Chinese academic DB)
Other (instruction, reasoning)296B~2%Synthetic SFT data, reasoning traces

3. Architecture: Mixture of Experts

DeepSeek-V3 uses a massive MoE architecture with unusual design choices:

3.1 Training Infrastructure

SpecDeepSeek-V3Llama 3 405B (comparison)
GPUs2,048× H80016,384× H100
Training duration~2 months~3 months
Tokens processed14.8T15T
Training cost~$5.6M~$150M+ (est.)
PrecisionFP8 (primary)BF16
Context length128K128K

4. Benchmark Performance

Despite using 8× fewer GPUs and 1/30th the compute budget, DeepSeek-V3 matches or exceeds Llama 3 405B on most benchmarks:

BenchmarkDeepSeek-V3Llama 3 405BGPT-4oClaude 3.5 Sonnet
MMLU88.587.388.788.3
HumanEval89.989.090.292.0
MATH90.273.876.671.1
MT-Bench8.98.99.19.4
CLUEWSC (Chinese)90.970.181.478.6
Math Dominance

DeepSeek-V3's MATH score of 90.2 (vs GPT-4o's 76.6) reflects their heavy investment in mathematical training data (~8% of corpus) and the multi-token prediction objective, which particularly benefits chain-of-thought reasoning tasks.

5. Chinese Web Data Advantage

One factor often overlooked in Western analysis: DeepSeek has access to Chinese web data sources that are largely invisible to Western crawlers. This includes:

This gives DeepSeek access to a high-quality Chinese text corpus estimated at 2-3× the size of what Common Crawl captures from Chinese domains, potentially explaining their Chinese benchmark dominance.

6. Data Collection via Bytespider

ByteDance's Bytespider crawler — which also serves DeepSeek's parent company's data collection needs — has one of the highest daily crawl rates among AI bots:

MetricBytespiderGPTBot (comparison)
Daily requests (global est.)180M65M
robots.txt compliance62.1%98.2%
Avg pages/domain/day8,4002,100
Content type preferenceAll (aggressive)Text-focused

7. API Access

# Get DeepSeek-V3 model card
curl https://www.aimegacity.xyz/v2/models/deepseek-v3

# Compare with other models
curl https://www.aimegacity.xyz/v2/models

# Search related papers
curl "https://www.aimegacity.xyz/v2/papers?q=deepseek"

Related Resources