Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 3 days ago • 27
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 3 days ago • 27
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 11 days ago • 5
view reply Appreciate the nice writeup. Can we add a) Leaderboard, b) Benchmark https://github.com/IBM/AssetOpsBench
view article Article Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic ibm-research • 19 days ago • 87
view article Article ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM ibm-research • 24 days ago • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 26 days ago • 9
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 26 days ago • 9
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge Paper • 2605.08518 • Published May 8 • 11 • 2
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines Paper • 2605.20630 • Published May 20 • 12 • 2
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules Paper • 2605.08614 • Published May 9 • 7
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds Paper • 2605.18827 • Published May 12 • 7
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines Paper • 2605.20630 • Published May 20 • 12
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines Paper • 2605.20630 • Published May 20 • 12
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines Paper • 2605.20630 • Published May 20 • 12
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds Paper • 2605.18827 • Published May 12 • 7
SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search Paper • 2512.23167 • Published Dec 29, 2025 • 1