Unsupervised Topic Models are Data Mixers for Pre-training Language Models Paper • 2502.16802 • Published Feb 24
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser Paper • 2511.16397 • Published 18 days ago • 7
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining Paper • 2410.08102 • Published Oct 10, 2024 • 21