MARS: Enabling Autoregressive Models Multi-Token Generation Paper • 2604.07023 • Published Apr 8 • 38
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome Paper • 2603.28407 • Published Mar 30 • 70
From Perception to Action: An Interactive Benchmark for Vision Reasoning Paper • 2602.21015 • Published Feb 24 • 24
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Paper • 2601.09688 • Published Jan 14 • 127
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning Paper • 2506.04755 • Published Jun 5, 2025 • 37
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models Paper • 2503.24235 • Published Mar 31, 2025 • 55
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs Paper • 2410.04698 • Published Oct 7, 2024 • 13