The Legal Landscape of AI Web Scraping in 2025

Published January 12, 2025 · Policy Team · 18 min read · 9,100 views

AI PolicyCopyrightFair UseGDPRrobots.txtOpt-Out

1. A New Legal Era for AI Training Data

2024 was a watershed year for AI and copyright law. Landmark court rulings in both the United States and European Union fundamentally reshaped the legal framework around web scraping for AI model training. As we enter 2025, the landscape is clearer in some areas and murkier in others — but the direction of travel is unmistakable: web publishers are gaining more control over how their content is used.

This analysis summarizes the current legal status across major jurisdictions, examines real-world compliance data, and explores the technical and policy mechanisms emerging as the new standard for managing consent.

2. Key Rulings (2024)

2.1 United States: Expanded Fair Use with Conditions

The US legal system has moved toward a conditional fair use framework for AI training:

NYT v. OpenAI (settled): While the parties reached a confidential settlement, the pre-trial rulings established important precedent that verbatim reproduction of copyrighted text in model outputs constitutes infringement, even if the training itself might qualify as transformative use
Thomson Reuters v. ROSS Intelligence: Established that scraping and copying copyrighted headnotes for AI training did not qualify as fair use, particularly when the AI system's output competed with the original work's market
Doe v. GitHub (Copilot): Ongoing, but early rulings suggest that code suggestion tools that reproduce open-source code without attribution may violate open-source licenses, even if the training itself is permissible

Current US Framework

Training AI models on publicly available web data is generally permissible under fair use, provided: (1) the training purpose is sufficiently transformative, (2) the model does not reproduce substantial portions of copyrighted works in its output, and (3) the publisher has not explicitly opted out via technical measures (robots.txt, meta tags, or contractual terms).

2.2 European Union: The AI Act & GDPR Intersection

The EU's approach is more restrictive and regulation-heavy:

EU AI Act Article 53(1)(c): Requires providers of general-purpose AI models to implement a policy to respect opt-out mechanisms expressed in "machine-readable format" — effectively codifying robots.txt and meta tags as legally binding
Text and Data Mining (TDM) exception: Under the DSM Directive, TDM for AI training is permitted only when rights holders have not expressly reserved their rights. A robots.txt disallow constitutes an "express reservation"
GDPR considerations: Web scraping that collects personal data (names, emails, addresses visible on web pages) must comply with GDPR's legal basis requirements — "legitimate interest" is increasingly challenged by DPAs

2.3 Japan: The Most Permissive Approach

Japan's copyright law (Article 30-4) remains the most AI-friendly globally, explicitly permitting the use of copyrighted works for computational analysis, including AI training, regardless of the rights holder's wishes. However, growing domestic pressure from manga publishers and music rights organizations may lead to amendments by 2026.

3. Opt-Out Mechanisms: Beyond robots.txt

While robots.txt remains the primary technical mechanism, a richer ecosystem of opt-out tools is emerging:

Mechanism	Standard	Legal Force	AI Company Support
robots.txt (User-Agent specific)	De facto standard	Strong (EU AI Act)	All major bots
Meta robots tags (noai, noimageai)	Proposed standard	Emerging	Google, Bing, some others
HTTP headers (X-Robots-Tag)	Google standard	Moderate	Google-Extended
TDM Reservation Protocol (TDMRep)	W3C proposal	EU-recognized	Limited
C2PA content credentials	C2PA standard	Weak (voluntary)	Adobe, Microsoft
Do Not Train registry	spawning.ai	Weak (voluntary)	Stability AI, some others
Contractual terms (ToS)	Legal contract	Strong	N/A (enforcement varies)

4. Compliance Rates: Who Respects the Rules?

We analyzed robots.txt files across 50,000 websites that explicitly opt out of AI training, and monitored whether each bot respects those directives:

Bot	Company	Compliance Rate	Notable Violations
Google-Extended	Google	99.7%	Rare edge cases with cached directives
Applebot-Extended	Apple	99.1%	Minimal; mostly timing delays in directive pickup
PerplexityBot	Perplexity	98.8%	Improved significantly after June 2024 controversy
GPTBot	OpenAI	98.2%	Occasional crawls during robots.txt update windows
ClaudeBot	Anthropic	97.9%	Very few violations, mostly on new domains
meta-externalagent	Meta	97.1%	Moderate; some patterns suggest A/B testing of limits
CCBot	Common Crawl	94.5%	Legacy crawls from pre-opt-out period
Bytespider	ByteDance	62.1%	Widespread disregard for AI-specific blocks

Bytespider Anomaly

ByteDance's Bytespider has the lowest compliance rate of any major AI crawler at 62.1%. It frequently ignores AI-specific opt-outs (blocking Bytespider but not generic crawlers) while generally respecting blanket Disallow: / directives. This pattern suggests intentional selective compliance — respecting broad blocks but ignoring targeted AI training opt-outs.

5. The Publisher Response

Content publishers have responded to the AI training data debate with varying strategies:

Paywall expansion: 34% of major news sites added paywalls or login walls in 2024 specifically to prevent AI crawling
Licensing deals: Over $600M in content licensing deals signed between AI companies and publishers (NYT/OpenAI, AP/Google, Reuters/multiple)
Technological blocking: Cloudflare, Akamai, and other CDNs now offer AI bot blocking as a standard feature
Collective action: Coalitions like the Coalition for Content Provenance and Authenticity (C2PA) gaining membership
Open embrace: Some sites (like AI Megacity) intentionally welcome all crawlers as part of the open data ecosystem

6. What This Means for AI Training

For AI companies building training datasets in 2025:

robots.txt is now quasi-law — especially in the EU, ignoring it creates direct legal liability
The output problem matters most — courts care more about what the model generates than what it was trained on
Licensing is the safest path — for high-value content (news, academic papers, books), licensing eliminates legal risk
Synthetic data is growing — partly as a legal strategy to reduce dependence on potentially contested web data
Transparency builds trust — companies that publish their crawling policies and respect opt-outs face fewer legal challenges

7. Looking Ahead: 2025-2026

Several developments to watch:

US Congress considering federal AI training data transparency requirements
EU AI Act enforcement beginning February 2025 — first compliance deadlines for GPAI providers
Japan potentially amending Article 30-4 under domestic pressure
W3C TDMRep protocol gaining momentum as a universal machine-readable opt-out standard
More AI companies publishing "model cards" with training data provenance documentation

Related Resources

Crawler Database — GET /v2/crawlers

Compliance rates, daily request volumes, and user agents for 47 tracked AI bots

Apple's Applebot Surge: 840% Traffic Increase

How Apple's crawler behavior changed alongside its privacy-first AI strategy

GPTBot vs ClaudeBot: Crawling Philosophy Comparison

Different approaches to web data collection and content selection