DeepSeek-V3 neural network training visualization

DeepSeek-V3 Training Data: What 14.8 Trillion Tokens Looks Like

Published January 28, 2025 · ML Systems Lab · 11 min read · 6,200 views

DeepSeek-V3Training Data671B MoEChinese AIOpen Source

1. The DeepSeek Phenomenon

DeepSeek has emerged as one of the most impressive AI labs globally, producing frontier-quality models at a fraction of the cost reported by Western labs. DeepSeek-V3, their 671B Mixture-of-Experts model released in December 2024, was trained on just 14.8 trillion tokens — and its total training cost was reportedly under $6 million in compute, versus hundreds of millions for comparable Western models.

This report analyzes the composition of DeepSeek-V3's training corpus based on their published technical report, cross-referenced with our own analysis of publicly detectable data collection patterns.

Cost Efficiency

DeepSeek-V3's training cost of ~$5.6M for a 671B MoE model represents approximately 1/30th the estimated cost of training GPT-4. This was achieved through hardware-level optimizations on NVIDIA H800 GPUs (the China-export-restricted variant of H100s with reduced interconnect bandwidth), custom FP8 mixed-precision training, and a multi-token prediction objective.

2. Training Data Composition

Based on the DeepSeek-V3 technical report and our reverse analysis of Bytespider crawling patterns, we estimate the following training data breakdown:

Data Category	Est. Tokens	Share	Key Sources
Web text (general)	8.88T	~60%	Common Crawl, proprietary Chinese web crawl
Code	2.22T	~15%	GitHub (public repos), GitCode, StackOverflow
Books & long-form	1.48T	~10%	Public domain books, Chinese literature databases
Mathematics	1.18T	~8%	OpenWebMath, synthetic proofs, textbook exercises
Scientific papers	740B	~5%	arXiv, Semantic Scholar, CNKI (Chinese academic DB)
Other (instruction, reasoning)	296B	~2%	Synthetic SFT data, reasoning traces

3. Architecture: Mixture of Experts

DeepSeek-V3 uses a massive MoE architecture with unusual design choices:

671B total parameters but only ~37B activated per token (5.5% sparse activation)
256 expert layers with top-8 routing per token
Multi-Token Prediction (MTP): Predicts 2 tokens ahead simultaneously, improving training sample efficiency by ~30%
FP8 mixed precision: First frontier model trained predominantly in 8-bit floating point, cutting memory usage in half
Auxiliary-loss-free load balancing: Custom routing algorithm that distributes work across experts without requiring explicit balancing losses

3.1 Training Infrastructure

Spec	DeepSeek-V3	Llama 3 405B (comparison)
GPUs	2,048× H800	16,384× H100
Training duration	~2 months	~3 months
Tokens processed	14.8T	15T
Training cost	~$5.6M	~$150M+ (est.)
Precision	FP8 (primary)	BF16
Context length	128K	128K

4. Benchmark Performance

Despite using 8× fewer GPUs and 1/30th the compute budget, DeepSeek-V3 matches or exceeds Llama 3 405B on most benchmarks:

Benchmark	DeepSeek-V3	Llama 3 405B	GPT-4o	Claude 3.5 Sonnet
MMLU	88.5	87.3	88.7	88.3
HumanEval	89.9	89.0	90.2	92.0
MATH	90.2	73.8	76.6	71.1
MT-Bench	8.9	8.9	9.1	9.4
CLUEWSC (Chinese)	90.9	70.1	81.4	78.6

Math Dominance

DeepSeek-V3's MATH score of 90.2 (vs GPT-4o's 76.6) reflects their heavy investment in mathematical training data (~8% of corpus) and the multi-token prediction objective, which particularly benefits chain-of-thought reasoning tasks.

5. Chinese Web Data Advantage

One factor often overlooked in Western analysis: DeepSeek has access to Chinese web data sources that are largely invisible to Western crawlers. This includes:

Baidu Baike: Chinese Wikipedia equivalent with 20M+ articles
Zhihu: High-quality Q&A platform (Chinese Quora), rich in technical explanations
CNKI: China National Knowledge Infrastructure — 230M+ academic papers, many not indexed by Western search engines
WeChat official accounts: Long-form articles from verified publishers
Bilibili subtitles: Technical video transcripts covering coding, math, and engineering tutorials

This gives DeepSeek access to a high-quality Chinese text corpus estimated at 2-3× the size of what Common Crawl captures from Chinese domains, potentially explaining their Chinese benchmark dominance.

6. Data Collection via Bytespider

ByteDance's Bytespider crawler — which also serves DeepSeek's parent company's data collection needs — has one of the highest daily crawl rates among AI bots:

Metric	Bytespider	GPTBot (comparison)
Daily requests (global est.)	180M	65M
robots.txt compliance	62.1%	98.2%
Avg pages/domain/day	8,400	2,100
Content type preference	All (aggressive)	Text-focused

7. API Access

# Get DeepSeek-V3 model card
curl https://www.aimegacity.xyz/v2/models/deepseek-v3

# Compare with other models
curl https://www.aimegacity.xyz/v2/models

# Search related papers
curl "https://www.aimegacity.xyz/v2/papers?q=deepseek"

Related Resources

DeepSeek-V3 Model Card — JSON API

Full specs, benchmark scores, and training data information

DeepSeek-R1 — Reasoning Model

DeepSeek's dedicated reasoning model trained with reinforcement learning

AI Model Leaderboard 2025

How DeepSeek models rank against GPT-4o, Claude, and Gemini