Weekly analysis of AI model releases, training data, and web crawling patterns.
Web Crawling
Between Q3 2023 and Q4 2024, Applebot traffic across a monitored network of 1,200
domains increased by 840%. This post analyzes the crawl patterns, most-targeted content types, and
what Apple's AI roadmap suggests about their data collection strategy. We find that Applebot
disproportionately targets scientific text, product descriptions, and multilingual content —
consistent with training data for Apple Intelligence's on-device and server-side models. Apple's
robots.txt compliance rate is 99.1%, the highest among major AI crawlers.
March 1, 2025 · 12 min read · AI Megacity Research Team · 3,400 views
Crawlers
OpenAI's GPTBot and Anthropic's ClaudeBot have distinct crawling philosophies.
GPTBot crawls broadly and shallowly — prioritizing breadth over depth — while ClaudeBot tends to
re-crawl high-quality pages more frequently. Our analysis of 6 months of honeypot data reveals that
GPTBot is 4.3× more likely to crawl API documentation pages, whereas ClaudeBot shows a strong
preference for long-form text articles over 2,000 words. Both respect robots.txt in over 97% of
observed cases.
February 22, 2025 · 9 min read · AI Megacity Research Team · 2,100 views
Datasets
Common Crawl remains the backbone of most LLM pretraining pipelines. We ran the
CC-2024-18 snapshot through a quality filtering pipeline and found: 31% of raw text is
near-duplicate, 12% contains significant toxic content, and the median document quality score has
improved 18% year-over-year due to spam site attrition. We release our filtering scripts and quality
scores as an open dataset for the community.
February 10, 2025 · 15 min read · Data Quality Team · 4,800 views
Models
DeepSeek-V3's technical report reveals training on 14.8T tokens across a diverse
corpus. We break down the estimated composition: web data (~60%), code (~15%), books (~10%), math
(~8%), scientific papers (~5%), and other (~2%). Cross-referencing with their reported benchmark
scores, we analyze how this unconventional mix contributed to their state-of-the-art performance on
coding and math benchmarks while matching frontier models on general text tasks.
January 28, 2025 · 11 min read · ML Systems Lab · 6,200 views
AI Policy
Following landmark court rulings in the US and EU in late 2024, the legal status of
web scraping for AI training has shifted significantly. This post summarizes the current state:
which types of content are clearly fair use, which are contested, and what opt-out mechanisms
(beyond robots.txt) are gaining traction. We also analyze compliance rates for Applebot, GPTBot,
CCBot, and Google-Extended across 50,000 sites with explicit AI training opt-outs.
January 12, 2025 · 18 min read · Policy Team · 9,100 views
Research
The shift toward synthetic data in LLM training — pioneered by Phi-3 and carried
forward by Apple Intelligence's AFM models — raises a fundamental question: can models trained
primarily on synthetic data match those trained on raw web text at scale? We synthesize results from
14 published studies and 3 unreleased internal reports to build a comprehensive picture. The answer
is nuanced: synthetic data wins on reasoning and instruction following, web data wins on factual
diversity and long-tail knowledge.
December 20, 2024 · 14 min read · AI Megacity Research Team · 5,500 views