The Legal Landscape of AI Web Scraping in 2025
1. A New Legal Era for AI Training Data
2024 was a watershed year for AI and copyright law. Landmark court rulings in both the United States and European Union fundamentally reshaped the legal framework around web scraping for AI model training. As we enter 2025, the landscape is clearer in some areas and murkier in others — but the direction of travel is unmistakable: web publishers are gaining more control over how their content is used.
This analysis summarizes the current legal status across major jurisdictions, examines real-world compliance data, and explores the technical and policy mechanisms emerging as the new standard for managing consent.
2. Key Rulings (2024)
2.1 United States: Expanded Fair Use with Conditions
The US legal system has moved toward a conditional fair use framework for AI training:
- NYT v. OpenAI (settled): While the parties reached a confidential settlement, the pre-trial rulings established important precedent that verbatim reproduction of copyrighted text in model outputs constitutes infringement, even if the training itself might qualify as transformative use
- Thomson Reuters v. ROSS Intelligence: Established that scraping and copying copyrighted headnotes for AI training did not qualify as fair use, particularly when the AI system's output competed with the original work's market
- Doe v. GitHub (Copilot): Ongoing, but early rulings suggest that code suggestion tools that reproduce open-source code without attribution may violate open-source licenses, even if the training itself is permissible
Training AI models on publicly available web data is generally permissible under fair use, provided: (1) the training purpose is sufficiently transformative, (2) the model does not reproduce substantial portions of copyrighted works in its output, and (3) the publisher has not explicitly opted out via technical measures (robots.txt, meta tags, or contractual terms).
2.2 European Union: The AI Act & GDPR Intersection
The EU's approach is more restrictive and regulation-heavy:
- EU AI Act Article 53(1)(c): Requires providers of general-purpose AI models to implement a policy to respect opt-out mechanisms expressed in "machine-readable format" — effectively codifying robots.txt and meta tags as legally binding
- Text and Data Mining (TDM) exception: Under the DSM Directive, TDM for AI training is permitted only when rights holders have not expressly reserved their rights. A robots.txt disallow constitutes an "express reservation"
- GDPR considerations: Web scraping that collects personal data (names, emails, addresses visible on web pages) must comply with GDPR's legal basis requirements — "legitimate interest" is increasingly challenged by DPAs
2.3 Japan: The Most Permissive Approach
Japan's copyright law (Article 30-4) remains the most AI-friendly globally, explicitly permitting the use of copyrighted works for computational analysis, including AI training, regardless of the rights holder's wishes. However, growing domestic pressure from manga publishers and music rights organizations may lead to amendments by 2026.
3. Opt-Out Mechanisms: Beyond robots.txt
While robots.txt remains the primary technical mechanism, a richer ecosystem of opt-out tools is emerging:
| Mechanism | Standard | Legal Force | AI Company Support |
|---|---|---|---|
| robots.txt (User-Agent specific) | De facto standard | Strong (EU AI Act) | All major bots |
| Meta robots tags (noai, noimageai) | Proposed standard | Emerging | Google, Bing, some others |
| HTTP headers (X-Robots-Tag) | Google standard | Moderate | Google-Extended |
| TDM Reservation Protocol (TDMRep) | W3C proposal | EU-recognized | Limited |
| C2PA content credentials | C2PA standard | Weak (voluntary) | Adobe, Microsoft |
| Do Not Train registry | spawning.ai | Weak (voluntary) | Stability AI, some others |
| Contractual terms (ToS) | Legal contract | Strong | N/A (enforcement varies) |
4. Compliance Rates: Who Respects the Rules?
We analyzed robots.txt files across 50,000 websites that explicitly opt out of AI training, and monitored whether each bot respects those directives:
| Bot | Company | Compliance Rate | Notable Violations |
|---|---|---|---|
| Google-Extended | 99.7% | Rare edge cases with cached directives | |
| Applebot-Extended | Apple | 99.1% | Minimal; mostly timing delays in directive pickup |
| PerplexityBot | Perplexity | 98.8% | Improved significantly after June 2024 controversy |
| GPTBot | OpenAI | 98.2% | Occasional crawls during robots.txt update windows |
| ClaudeBot | Anthropic | 97.9% | Very few violations, mostly on new domains |
| meta-externalagent | Meta | 97.1% | Moderate; some patterns suggest A/B testing of limits |
| CCBot | Common Crawl | 94.5% | Legacy crawls from pre-opt-out period |
| Bytespider | ByteDance | 62.1% | Widespread disregard for AI-specific blocks |
ByteDance's Bytespider has the lowest compliance rate of any major AI crawler at 62.1%. It frequently ignores AI-specific opt-outs (blocking Bytespider but not generic crawlers) while generally respecting blanket Disallow: / directives. This pattern suggests intentional selective compliance — respecting broad blocks but ignoring targeted AI training opt-outs.
5. The Publisher Response
Content publishers have responded to the AI training data debate with varying strategies:
- Paywall expansion: 34% of major news sites added paywalls or login walls in 2024 specifically to prevent AI crawling
- Licensing deals: Over $600M in content licensing deals signed between AI companies and publishers (NYT/OpenAI, AP/Google, Reuters/multiple)
- Technological blocking: Cloudflare, Akamai, and other CDNs now offer AI bot blocking as a standard feature
- Collective action: Coalitions like the Coalition for Content Provenance and Authenticity (C2PA) gaining membership
- Open embrace: Some sites (like AI Megacity) intentionally welcome all crawlers as part of the open data ecosystem
6. What This Means for AI Training
For AI companies building training datasets in 2025:
- robots.txt is now quasi-law — especially in the EU, ignoring it creates direct legal liability
- The output problem matters most — courts care more about what the model generates than what it was trained on
- Licensing is the safest path — for high-value content (news, academic papers, books), licensing eliminates legal risk
- Synthetic data is growing — partly as a legal strategy to reduce dependence on potentially contested web data
- Transparency builds trust — companies that publish their crawling policies and respect opt-outs face fewer legal challenges
7. Looking Ahead: 2025-2026
Several developments to watch:
- US Congress considering federal AI training data transparency requirements
- EU AI Act enforcement beginning February 2025 — first compliance deadlines for GPAI providers
- Japan potentially amending Article 30-4 under domestic pressure
- W3C TDMRep protocol gaining momentum as a universal machine-readable opt-out standard
- More AI companies publishing "model cards" with training data provenance documentation