Controlled study: AI operational experience improves performance by 1.07 SD (open data + code)

#115
by Rushnur - opened

Hi everyone,

We just published a controlled experiment measuring the effect of accumulated operational experience on AI assistant performance.

Quick summary:

  • An AI assistant (ARIA) that has been operating for months, accumulating experience fragments and operational memory, was compared against the same base model (Claude Opus 4.6) without experience
  • 50 real-world questions, 1,200 blind judgments from 3 independent judges
  • Result: Cohen's d = 1.07, Friedman p < 10^-25
  • The effect is domain-specific — strong on operational tasks, near zero on algorithmic controls

This builds on work by ExpeL, MemGPT, Generative Agents, and Reflexion — but measures experience effects in a production system rather than a sandbox.

Everything is open:

Would love feedback from this community. Also seeking an arXiv cs.AI endorser if anyone is qualified endorsement code MJLELZ

https://arxiv.org/auth/endorse?x=MJLELZ

Thanks!
Ravshan Nuraliev, PaTech Labs

Sign up or log in to comment