Recursive Self-Evolving Agents via Held-Out Selection

ByAdmin

Jun 30, 2026

arXiv:2606.28374v1 Announce Type: new
Abstract: LLM agents are increasingly improved without weight updates by evolving a natural-language artifact, such as reflections, workflows, playbooks, cheatsheets, or optimized prompts, that conditions a frozen policy. Such methods are typically reported as wins on the single benchmark where they help. We study them apples-to-apples and surface a sharper picture. We introduce RSEA, a Recursive Self-Evolving Agent that carries a compact three-layer natural-language state: an imperative strategy, reusable skills, and a procedural playbook. Across generations, RSEA rewrites all three layers from its own trajectories and commits a candidate only if it does not regress on a disjoint held-out split, using a strict keep-better gate.
Across four diverse benchmarks, ALFWorld, GAIA, (tau)-bench, and WebShop, and six faithful baselines, ReAct, Reflexion, GEPA, AWM, ACE, and Dynamic Cheatsheet, all evaluated on one shared local backbone, we find three main results. First, no artifact universally wins. RSEA is the strongest single-pass method on ALFWorld, reaching 69.3% compared with 64.6% for ReAct (McNemar (p=0.015)), and reaches 79.4% with retry, the best overall result. However, concrete-workflow induction, represented by AWM, is best on the strong-backbone tool-use tasks. Second, unguarded context evolution is high-variance and unsafe. Dynamic Cheatsheet, which curates context online without a held-out gate, is near-best on ALFWorld at 70.7%, yet collapses on WebShop, with a score of 0.14 compared with 0.43 for ReAct. Third, RSEA’s strict held-out selection is what makes recursive self-evolution monotone-safe: it never significantly underperforms the base agent on any benchmark and falls back to vanilla ReAct when evolved context would hurt.

Recursive Self-Evolving Agents via Held-Out Selection

ByAdmin

By Admin

Related Post

Now You See the Hate: Adaptive View Retrieval for Hidden Hateful Illusions

When Shippers Become Algorithms: Candidate Exposure, Information Design, and the Concentration of LLM-Mediated Freight Markets

Time Series Network Utilization KPI Forecasting Using Advanced AI/ML Models

Leave a Reply Cancel reply

You missed

FineServe: A Fine-Grained Dataset and Characterization of Global LLM Serving Workloads

Agentic Real2Sim: Physics-based World Modeling with Vision-Language Agents

Time Series Network Utilization KPI Forecasting Using Advanced AI/ML Models

When Shippers Become Algorithms: Candidate Exposure, Information Design, and the Concentration of LLM-Mediated Freight Markets