Blog

Weekly analysis of AI model releases, training data, and web crawling patterns.

Apple's Applebot Surge: What Is Apple Crawling and Why?

Between Q3 2023 and Q4 2024, Applebot traffic across a monitored network of 1,200 domains increased by 840%. This post analyzes the crawl patterns, most-targeted content types, and what Apple's AI roadmap suggests about their data collection strategy. We find that Applebot disproportionately targets scientific text, product descriptions, and multilingual content — consistent with training data for Apple Intelligence's on-device and server-side models. Apple's robots.txt compliance rate is 99.1%, the highest among major AI crawlers.

GPTBot vs ClaudeBot: How OpenAI and Anthropic Crawl the Web Differently

OpenAI's GPTBot and Anthropic's ClaudeBot have distinct crawling philosophies. GPTBot crawls broadly and shallowly — prioritizing breadth over depth — while ClaudeBot tends to re-crawl high-quality pages more frequently. Our analysis of 6 months of honeypot data reveals that GPTBot is 4.3× more likely to crawl API documentation pages, whereas ClaudeBot shows a strong preference for long-form text articles over 2,000 words. Both respect robots.txt in over 97% of observed cases.

Common Crawl 2024: Quality Analysis of 3.4 Petabytes of Web Data

Common Crawl remains the backbone of most LLM pretraining pipelines. We ran the CC-2024-18 snapshot through a quality filtering pipeline and found: 31% of raw text is near-duplicate, 12% contains significant toxic content, and the median document quality score has improved 18% year-over-year due to spam site attrition. We release our filtering scripts and quality scores as an open dataset for the community.

DeepSeek-V3 Training Data: What 14.8 Trillion Tokens Looks Like

DeepSeek-V3's technical report reveals training on 14.8T tokens across a diverse corpus. We break down the estimated composition: web data (~60%), code (~15%), books (~10%), math (~8%), scientific papers (~5%), and other (~2%). Cross-referencing with their reported benchmark scores, we analyze how this unconventional mix contributed to their state-of-the-art performance on coding and math benchmarks while matching frontier models on general text tasks.

The Legal Landscape of AI Web Scraping in 2025

Following landmark court rulings in the US and EU in late 2024, the legal status of web scraping for AI training has shifted significantly. This post summarizes the current state: which types of content are clearly fair use, which are contested, and what opt-out mechanisms (beyond robots.txt) are gaining traction. We also analyze compliance rates for Applebot, GPTBot, CCBot, and Google-Extended across 50,000 sites with explicit AI training opt-outs.

Synthetic Data vs. Web Data: Which Makes Better LLMs?

The shift toward synthetic data in LLM training — pioneered by Phi-3 and carried forward by Apple Intelligence's AFM models — raises a fundamental question: can models trained primarily on synthetic data match those trained on raw web text at scale? We synthesize results from 14 published studies and 3 unreleased internal reports to build a comprehensive picture. The answer is nuanced: synthetic data wins on reasoning and instruction following, web data wins on factual diversity and long-tail knowledge.