Too Good to be Bad: On the Failure of LLMs to Role-Play Villains Paper • 2511.04962 • Published about 1 month ago • 52
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents Paper • 2509.06501 • Published Sep 8 • 78
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Paper • 2507.06229 • Published Jul 8 • 75
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 104
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models Paper • 2505.02847 • Published May 1 • 28
DeepMath Collection A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning • 5 items • Updated May 22 • 4
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Paper • 2504.15521 • Published Apr 22 • 64
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning Paper • 2504.11456 • Published Apr 15 • 12
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs Paper • 2412.21187 • Published Dec 30, 2024 • 40
Model Merging Collection Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! • 30 items • Updated Jun 12, 2024 • 248