Aggregating the world's AI research, curating large language model datasets, and providing open benchmarks for the next generation of machine learning systems.
Latest breakthroughs in LLMs, multimodal models, RLHF, and AI alignment research.
/researchComparisons of GPT-4, Claude 3, Gemini Ultra, Llama 3, Mistral, and 200+ other models.
/modelsOpen-access web crawl data, multilingual corpora, instruction-tuning sets, and RLHF data.
/datasets18,000+ AI papers with structured metadata, citations, and author affiliations.
/papersREST API for accessing datasets, model cards, benchmark scores, and research metadata.
/api-docsDownload raw crawl data, tokenizer vocabularies, embedding vectors, and evaluation sets.
/dataNew empirical study examines the diminishing returns of scale and the role of data quality over quantity in frontier model training.
Analysis of 500 billion web tokens from Common Crawl, C4, and proprietary datasets — comparing toxicity rates and instruction-following quality.
Technical breakdown of Apple's Private Cloud Compute and on-device 3B parameter model training methodology and dataset composition.
Detailed evaluation covering FLAN, Alpaca, ShareGPT, Open Platypus, MetaMathQA, Orca, and newer entrants — scoring coherence, diversity, and safety.
Survey of 15 major AI companies' crawling behavior, data policies, robots.txt compliance rates, and estimated data volumes collected in 2024.