Problem
Most benchmarks test retention, but real environments change - stored cues become stale and must be updated.
Paper / AAMAS 2026
In dynamic POMDP settings, it’s not enough to store information - an agent must rewrite memory: forget outdated information and integrate new evidence.
The paper isolates and measures an RL agent’s ability to rewrite memory under partial observability - not just retain it.
Most benchmarks test retention, but real environments change - stored cues become stale and must be updated.
Two task families - Endless T‑Maze and Color‑Cubes - require continual replacement of outdated information.
Architectures with explicit, learnable forgetting (e.g., LSTM) are more robust at rewriting and generalize better.
Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift.
Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored.
To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, and use it to compare recurrent, transformer-based, and structured memory architectures.
Our experiments reveal that classic recurrent models demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases.
These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating.
Two task families capture different rewriting regimes: from a minimal diagnostic setup to a stress test with selective updates and uncertainty.
An infinite chain of corridors and junctions. At the start of each corridor the agent receives a new binary cue (left/right) that completely invalidates the previous one.
A grid world with N uniquely colored cubes. The agent gets a target color and must find the matching cube; non-target cubes may teleport, and full state updates appear only after events.
The paper groups results by four research questions (RQ1–RQ4).
In Endless T‑Maze with n = 1 (classic T‑Maze), PPO‑LSTM, SHM, and FFM reach 1.00±0.00, while GTrXL behaves
like ~50% (MLP-level).
In Trivial Color‑Cubes the picture differs: PPO‑LSTM ≈ 0.52±0.10, while FFM/GTrXL/SHM reach 1.00±0.00.
PPO‑LSTM is most robust when cues continually become obsolete. FFM and SHM succeed in predictable Fixed settings but lose flexibility under Uniform. GTrXL and MLP fail to produce stable results when rewriting is critical.
PPO‑LSTM interpolates across all lengths in non-trivial Uniform configurations and shows partially successful extrapolation to longer corridors. FFM generalizes within Fixed (both interpolation and extrapolation), SHM is mostly limited to interpolation; GTrXL sometimes interpolates better than SHM but is unstable.
In settings where the agent must not just “forget” but re-rank and maintain an up-to-date map (especially Extreme), all baselines show 0.00±0.00.
| Color‑Cubes | PPO‑LSTM | GTrXL | SHM | FFM | MLP |
|---|---|---|---|---|---|
| Trivial | 0.52±0.10 | 1.00±0.00 | 1.00±0.00 | 1.00±0.00 | 0.00±0.00 |
| Medium | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 |
| Extreme | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 |
The paper’s hypothesis: PPO‑LSTM succeeds thanks to gating mechanisms, especially an adaptive forget gate (vs. RNN and GRU).
Observation: gates strongly improve rewriting success; LSTM’s advantage over GRU highlights the importance of adaptive forgetting.
Retention is necessary but not sufficient. The core challenge is controlled transformation of memory over time: forgetting, rewriting, and reallocating representational capacity.
Per the paper: PPO‑LSTM → FFM → SHM → GTrXL → MLP (retention/rewrite/generalization).
Explicit forgetting mechanisms and their adaptivity (forget gates) correlate with rewriting competence.
Treat memory as an active belief state, not a passive history buffer.