Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

ByAdmin

Apr 28, 2026

arXiv:2604.21999v2 Announce Type: replace-cross
Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested — 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing — no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64.
During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves’ recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 (“deep start,” p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices.
With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

ByAdmin

By Admin

Related Post

PatientAgentBench: A Benchmark Framework for Evaluating Patient-Facing Health AI Agents

Matrix-Free Photoacoustic Image Reconstruction via Sensor-Token Self-Attention

Evaluating Communicative Belief Updates in Large Language Models via Implicature Recognition and Cancellation

You missed

CoTinyVLA: Chain-of-Thought Distillation for a Sub-Billion-Parameter Vision-Language-Action Model

Inferring Missing Trajectory Data with Temporal Convolutional Networks

Toward Standardized Cross-Vendor Agent Tool Trust Management in Autonomous Networks

Evaluating Communicative Belief Updates in Large Language Models via Implicature Recognition and Cancellation