Memory Rewriting - Retention Is Not Enough

About

The paper isolates and measures an RL agentвЂ™s ability to rewrite memory under partial observability - not just retain it.

Problem

Most benchmarks test retention, but real environments change - stored cues become stale and must be updated.

Approach

Two task families - Endless TвЂ‘Maze and ColorвЂ‘Cubes - require continual replacement of outdated information.

Observation

Architectures with explicit, learnable forgetting (e.g., LSTM) are more robust at rewriting and generalize better.

Abstract

Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift.

Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored.

To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, and use it to compare recurrent, transformer-based, and structured memory architectures.

Our experiments reveal that classic recurrent models demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases.

These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating.

Benchmarks

Two task families capture different rewriting regimes: from a minimal diagnostic setup to a stress test with selective updates and uncertainty.

Endless TвЂ‘Maze

An infinite chain of corridors and junctions. At the start of each corridor the agent receives a new binary cue (left/right) that completely invalidates the previous one.

Tests: pure overwrite (no accumulation).
Regimes: Fixed (constant length) and Uniform (random length).
Reward: +1 correct, в€’1 incorrect, small penalty for delay.

**TвЂ‘Maze vs Endless TвЂ‘Maze.** In Endless, the task is to continually replace the cue.

ColorвЂ‘Cubes

A grid world with N uniquely colored cubes. The agent gets a target color and must find the matching cube; non-target cubes may teleport, and full state updates appear only after events.

Tests: selective updates and a consistent internal map.
Teleportation: random moves break stale memory.
Extreme: colors are hidden on updates в†’ infer mappings.

ColorвЂ‘Cubes environment diagram — **ColorвЂ‘Cubes.** Initialization, teleportations, and a successful interaction with the target cube.

Results

The paper groups results by four research questions (RQ1вЂ“RQ4).

RQ1 - Retention (trivial cases)

In Endless TвЂ‘Maze with n = 1 (classic TвЂ‘Maze), PPOвЂ‘LSTM, SHM, and FFM reach 1.00В±0.00, while GTrXL behaves like ~50% (MLP-level).

In Trivial ColorвЂ‘Cubes the picture differs: PPOвЂ‘LSTM в‰€ 0.52В±0.10, while FFM/GTrXL/SHM reach 1.00В±0.00.

RQ2 - Rewriting (required)

PPOвЂ‘LSTM is most robust when cues continually become obsolete. FFM and SHM succeed in predictable Fixed settings but lose flexibility under Uniform. GTrXL and MLP fail to produce stable results when rewriting is critical.

Intermediate progress of SHM and GTrXL in Endless TвЂ‘Maze — **Intermediate progress.** Even at low success rates, some models pass a few corridors before failing.

Baseline comparison under interpolation and extrapolation in Endless TвЂ‘Maze — **Generalization.** Interpolation/extrapolation across corridor counts under fixed regimes.

RQ3 - Generalization

PPOвЂ‘LSTM interpolates across all lengths in non-trivial Uniform configurations and shows partially successful extrapolation to longer corridors. FFM generalizes within Fixed (both interpolation and extrapolation), SHM is mostly limited to interpolation; GTrXL sometimes interpolates better than SHM but is unstable.

RQ4 - Prioritization (ColorвЂ‘Cubes Medium/Extreme)

In settings where the agent must not just вЂњforgetвЂќ but re-rank and maintain an up-to-date map (especially Extreme), all baselines show 0.00В±0.00.

ColorвЂ‘Cubes	PPOвЂ‘LSTM	GTrXL	SHM	FFM	MLP
Trivial	0.52В±0.10	1.00В±0.00	1.00В±0.00	1.00В±0.00	0.00В±0.00
Medium	0.00В±0.00	0.00В±0.00	0.00В±0.00	0.00В±0.00	0.00В±0.00
Extreme	0.00В±0.00	0.00В±0.00	0.00В±0.00	0.00В±0.00	0.00В±0.00

Ablation

The paperвЂ™s hypothesis: PPOвЂ‘LSTM succeeds thanks to gating mechanisms, especially an adaptive forget gate (vs. RNN and GRU).

What was compared

PPOвЂ‘RNN: recurrent layer without gates.
PPOвЂ‘GRU: simplified gates (reset/update).
PPOвЂ‘LSTM: dedicated gates (forget/input/output).

Observation: gates strongly improve rewriting success; LSTMвЂ™s advantage over GRU highlights the importance of adaptive forgetting.

Ablation: RNN/GRU/LSTM comparison on Endless TвЂ‘Maze — **Ablation.** The role of gates in interpolation/extrapolation.

Conclusion (short)

Retention is necessary but not sufficient. The core challenge is controlled transformation of memory over time: forgetting, rewriting, and reallocating representational capacity.

Hierarchy

Per the paper: PPOвЂ‘LSTM в†’ FFM в†’ SHM в†’ GTrXL в†’ MLP (retention/rewrite/generalization).

Why

Explicit forgetting mechanisms and their adaptivity (forget gates) correlate with rewriting competence.

Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning