New empirical study examines the diminishing returns of scale and the role of data quality over quantity in frontier model training.

March 2025 · AI Megacity Research Team · 42 citations

Analysis of 500 billion web tokens from Common Crawl, C4, and proprietary datasets — comparing toxicity rates and instruction-following quality.

February 2025 · Benchmark Group · 67 citations

Technical breakdown of Apple's Private Cloud Compute and on-device 3B parameter model training methodology and dataset composition.

January 2025 · ML Systems Lab · 89 citations

Detailed evaluation covering FLAN, Alpaca, ShareGPT, Open Platypus, MetaMathQA, Orca, and newer entrants — scoring coherence, diversity, and safety.

December 2024 · Open Data Initiative · 134 citations

Survey of 15 major AI companies' crawling behavior, data policies, robots.txt compliance rates, and estimated data volumes collected in 2024.

November 2024 · Internet Archive Partners · 211 citations

The Open Hub for
AI Intelligence

Latest Research Updates