Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Paper • 2511.15065 • Published 25 days ago • 74
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models Paper • 2511.13704 • Published 27 days ago • 42
P1: Mastering Physics Olympiads with Reinforcement Learning Paper • 2511.13612 • Published 27 days ago • 132
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6 • 208
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18 • 111
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration Paper • 2509.14760 • Published Sep 18 • 53
HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark? Paper • 2509.07894 • Published Sep 9 • 31
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics Paper • 2508.18124 • Published Aug 25 • 48