Memory Rewriting Retention is not enough

Paper / AAMAS 2026

Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning

In dynamic POMDP settings, it’s not enough to store information - an agent must rewrite memory: forget outdated information and integrate new evidence.

Authors
Oleg Shchendrigin · Egor Cherepanov · Alexey K. Kovalev · Aleksandr I. Panov
Illustration: memory rewriting vs retention
Memory rewriting vs retention. A new cue must overwrite the old one; otherwise the agent acts incorrectly.

About

The paper isolates and measures an RL agent’s ability to rewrite memory under partial observability - not just retain it.

Problem

Most benchmarks test retention, but real environments change - stored cues become stale and must be updated.

Approach

Two task families - Endless T‑Maze and Color‑Cubes - require continual replacement of outdated information.

Observation

Architectures with explicit, learnable forgetting (e.g., LSTM) are more robust at rewriting and generalize better.

Abstract

Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift.

Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored.

To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, and use it to compare recurrent, transformer-based, and structured memory architectures.

Our experiments reveal that classic recurrent models demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases.

These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating.

Benchmarks

Two task families capture different rewriting regimes: from a minimal diagnostic setup to a stress test with selective updates and uncertainty.

Endless T‑Maze

An infinite chain of corridors and junctions. At the start of each corridor the agent receives a new binary cue (left/right) that completely invalidates the previous one.

  • Tests: pure overwrite (no accumulation).
  • Regimes: Fixed (constant length) and Uniform (random length).
  • Reward: +1 correct, −1 incorrect, small penalty for delay.
T‑Maze vs Endless T‑Maze
T‑Maze vs Endless T‑Maze. In Endless, the task is to continually replace the cue.

Color‑Cubes

A grid world with N uniquely colored cubes. The agent gets a target color and must find the matching cube; non-target cubes may teleport, and full state updates appear only after events.

  • Tests: selective updates and a consistent internal map.
  • Teleportation: random moves break stale memory.
  • Extreme: colors are hidden on updates → infer mappings.
Color‑Cubes environment diagram
Color‑Cubes. Initialization, teleportations, and a successful interaction with the target cube.

Results

The paper groups results by four research questions (RQ1–RQ4).

RQ1 - Retention (trivial cases)

In Endless T‑Maze with n = 1 (classic T‑Maze), PPO‑LSTM, SHM, and FFM reach 1.00±0.00, while GTrXL behaves like ~50% (MLP-level).

In Trivial Color‑Cubes the picture differs: PPO‑LSTM ≈ 0.52±0.10, while FFM/GTrXL/SHM reach 1.00±0.00.

RQ2 - Rewriting (required)

PPO‑LSTM is most robust when cues continually become obsolete. FFM and SHM succeed in predictable Fixed settings but lose flexibility under Uniform. GTrXL and MLP fail to produce stable results when rewriting is critical.

Intermediate progress of SHM and GTrXL in Endless T‑Maze
Intermediate progress. Even at low success rates, some models pass a few corridors before failing.
Baseline comparison under interpolation and extrapolation in Endless T‑Maze
Generalization. Interpolation/extrapolation across corridor counts under fixed regimes.

RQ3 - Generalization

PPO‑LSTM interpolates across all lengths in non-trivial Uniform configurations and shows partially successful extrapolation to longer corridors. FFM generalizes within Fixed (both interpolation and extrapolation), SHM is mostly limited to interpolation; GTrXL sometimes interpolates better than SHM but is unstable.

RQ4 - Prioritization (Color‑Cubes Medium/Extreme)

In settings where the agent must not just “forget” but re-rank and maintain an up-to-date map (especially Extreme), all baselines show 0.00±0.00.

Color‑Cubes PPO‑LSTM GTrXL SHM FFM MLP
Trivial 0.52±0.10 1.00±0.00 1.00±0.00 1.00±0.00 0.00±0.00
Medium 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00
Extreme 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00

Ablation

The paper’s hypothesis: PPO‑LSTM succeeds thanks to gating mechanisms, especially an adaptive forget gate (vs. RNN and GRU).

What was compared

  • PPO‑RNN: recurrent layer without gates.
  • PPO‑GRU: simplified gates (reset/update).
  • PPO‑LSTM: dedicated gates (forget/input/output).

Observation: gates strongly improve rewriting success; LSTM’s advantage over GRU highlights the importance of adaptive forgetting.

Ablation: RNN/GRU/LSTM comparison on Endless T‑Maze
Ablation. The role of gates in interpolation/extrapolation.

Conclusion (short)

Retention is necessary but not sufficient. The core challenge is controlled transformation of memory over time: forgetting, rewriting, and reallocating representational capacity.

Hierarchy

Per the paper: PPO‑LSTM → FFM → SHM → GTrXL → MLP (retention/rewrite/generalization).

Why

Explicit forgetting mechanisms and their adaptivity (forget gates) correlate with rewriting competence.

Next

Treat memory as an active belief state, not a passive history buffer.