WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

Published in arXiv preprint, 2026

Existing large language models (LLMs) evaluations use fixed-difficulty benchmarks that cannot adapt as models improve, and rarely isolate specific cognitive processes. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a probe of cumulative state tracking — the ability to maintain and update intermediate results across K sequential operations within a single query, without a scratchpad. Unlike multi-step agent benchmarks that stress task orchestration, WMF-AM isolates within-pass cumulative load by parameterizing depth K.

Testing 20 open-weight models ranging from 0.5B to 35B parameters across 13 families, our probe predicts agent performance with a correlation of 0.612 (p < 0.001). Our findings suggest that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. Ablation experiments validate this interpretation, and our calibration approach maintains discriminative power where prior benchmarks plateau.

Code: github.com/dengzhe-hou/WMF-AM

Recommended citation: Hou, D., Jiang, L., Li, D., Li, Z., Lin, F., & Yamada, K. D. (2026). WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking. arXiv preprint arXiv:2603.27343.
Download Paper