Reinforcement Learning + Large Model Memory: Mem-α, Enabling Agents to "Learn How to Remember" for the First Time

In the rapidly developing era of large language models, "memory" is becoming key to whether agents can truly possess long-term intelligence.

Even with GPT-4.1 supporting millions of tokens in context, costs and latency still increase exponentially as interactions grow. This led to the emergence of external memory systems—however, most existing solutions rely on manual rules and prompt instructions, meaning the model doesn't truly "understand" when to remember, what to remember, or how to update.

The advent of Mem-α aims to solve this dilemma. Completed by Yu Wang of UC San Diego during an internship at Anuttacon, this work is the first to introduce reinforcement learning into the memory management system of large models, allowing models to autonomously learn how to use tools to store, update, and organize memories.

Paper Title: Mem-α: Learning Memory Construction via Reinforcement Learning

Paper Link: https://arxiv.org/abs/2509.25911

Code Repository: https://github.com/wangyu-ustc/Mem-alpha

Open Source Model: https://huggingface.co/YuWangX/Memalpha-4B

Training Dataset: https://huggingface.co/datasets/YuWangX/Memalpha

Test Dataset: https://huggingface.co/datasets/YuWangX/Memalpha-Memoryagentbench

Memory Bottleneck: The End of Manual Rules

Existing memory-augmented agents (e.g., MIRIX, MemGPT) typically rely on predefined instruction templates designed by developers to guide memory operations. However, in complex interactive environments, models often face three major challenges:

Not knowing which information is worth retaining long-term;

Uncertainty about when to update old memories;

Inability to allocate resources effectively across multiple types of memory.

The result is frequent "incorrect remembering" and "forgetting": as shown in the figure, before reinforcement learning optimization, the Qwen3-4B model failed to update core memory, and semantic memory only stored fragmented information, ultimately leading to incorrect answers. After training with Mem-α, the model began to exhibit "active learning" capabilities: it could identify key events and write them into Core Memory, Episodic Memory, and Semantic Memory respectively, achieving comprehensive information retention and compression.

From Rules to Learning: Mem-α's Core Mechanism

Mem-α's core contribution lies in transforming the memory construction problem into a sequential decision-making problem that can be optimized through reinforcement learning. Unlike previous methods relying on supervised learning or handcrafted rules, Mem-α allows agents to autonomously explore optimal memory management strategies during information flow processing and receive direct feedback from downstream task performance. This end-to-end optimization enables the model to learn truly effective memory construction strategies.

Task Setup

As shown in the figure above, Mem-α models memory construction as a sequential decision-making process. The agent processes information blocks sequentially, decides which memory operations to perform, and then uses the constructed memory system to answer questions. During training, feedback is received through multiple reward signals (to). The trained agent (🔥) focuses on learning memory management strategies, while the fixed large language model (❄️) is responsible for answering questions based on memory.

Reward Function Design

Mem-α employs a multi-dimensional reward function to optimize memory construction:

Question Answering Accuracy (): The most crucial signal, directly measuring the accuracy of answering questions based on memory.

Tool Call Format (): Ensures the agent correctly uses memory operation tools.

Memory Compression (): Encourages efficient use of memory space.

Content Validity (): Assesses memory quality via an LLM evaluator.

Final Reward: (Experiment found to be most effective).

Three-Layer Memory System Inspired by the Human Brain

Mem-α's architecture references memory classification theories from cognitive science, building a three-layer memory system:

Core Memory: Stores the user's long-term identity, goals, and preferences;

Episodic Memory: Records specific events with a timeline;

Semantic Memory: Stores structured knowledge and facts.

The agent needs to decide at each timestep which memory type to call and whether to perform insertion or update operations. Through reinforcement learning optimization, the model learns to "flexibly invoke different memory systems" like humans.

Training Dataset Construction

The construction idea for Mem-α's training dataset originates from four dimensions in MemoryAgentBench:

1. Accurate Retrieval: Extracts correct information from historical data to answer queries, covering single-hop and multi-hop retrieval scenarios.

2. Test-Time Learning: Acquires new behaviors or capabilities during deployment.

3. Long-Range Understanding: Integrates information distributed across multiple segments to answer queries requiring comprehensive sequential analysis.

4. Conflict Resolution: Revises, overwrites, or deletes previously stored information when encountering contradictory evidence.

This study focuses on the first three dimensions, excluding the conflict resolution dimension. This is because there is currently a lack of realistic evaluation benchmarks—existing conflict resolution datasets are primarily synthetic and fail to fully capture real-world complexity. The research team collected and organized eight datasets from different sources, processed them into a unified paradigm, and finally constructed a comprehensive dataset that ensures no overlap with the MemoryAgentBench test set, covering the aforementioned three dimensions for training.

Experimental Results

Main Experiment: Performance and Generalization Capability

Mem-α was trained on 30k tokens. Its performance on the validation set (also <30k tokens) is as follows:

Its performance on the test set is as follows:

Four Key Findings:

1. Comprehensive Outperformance of Existing Methods: Mem-α significantly outperforms baseline models in all evaluation tasks. It shows particularly outstanding performance in the Accurate Retrieval and Long-Range Understanding dimensions of MemoryAgentBench, demonstrating strong generalization capability to unseen distributions—proving that the memory strategy trained by reinforcement learning not only "learns well" but also "transfers widely."

2. Memory Compression with Both Efficiency and Performance: Compared to Long-Context and RAG-Top2, Mem-α achieves higher performance while reducing memory footprint by nearly 50%. In long-text understanding tasks such as BookSum and InfBench-Sum, the advantage of the semantic compression mechanism is further amplified, proving that it achieves an ideal balance between "fidelity" and "storage efficiency."

3. Decisive Role of Structured Memory: Experiments show that flat memory baselines (MEM1, MemAgent) using single paragraph representations are limited in complex tasks. In contrast, Mem-α's hierarchical memory architecture allows the model to distinguish between core, episodic, and semantic information layers, coupled with reinforcement learning optimization strategies, significantly enhancing the organization and retrieval capabilities of complex information.

4. Extremely Strong Length Extrapolation Capability: Although trained only on samples with an average length of less than 30K tokens, Mem-α can reliably generalize to ultra-long documents exceeding 400K tokens (MemoryAgentBench reaches up to 474K tokens). This means the model not only learned "how to remember" but also possesses reasoning robustness for extremely long sequences—achieving true length extrapolation for the first time in the field of memory modeling.

Ablation Study: From "Unable to Use Memory" to "Learning to Manage Memory"

In the ablation study, the research team compared the performance of Qwen3-4B before and after reinforcement learning training. The results showed that before the introduction of Mem-α, although the model had complete memory modules, it barely knew how to use them correctly—with an average accuracy of only 38.9%, frequent tool call errors, and chaotic updates to core and semantic memories. After Mem-α training, the model's performance underwent a qualitative change: accuracy jumped to 64.2%, and it could actively select appropriate memory types and operation sequences, achieving truly "autonomous memory management." This result proves that reinforcement learning not only improved task performance but also endowed the model with the ability to understand and optimize its own memory behavior.

From Engineering to Learning: The Future of Agent Memory

Mem-α shows us an important trend: "Memory management is no longer an engineering problem, but a learnable problem."

Through reinforcement learning signals, the model no longer relies on manually designed rules but evolves effective memory strategies through interaction. This research opens new directions for memory-augmented agents—in the future, similar mechanisms might extend to multimodal memory (images, audio), personalized memory strategies, and even multi-agent collaborative memory systems. As the paper's authors state, the significance of Mem-α lies in enabling agents to truly understand their own memory for the first time.

Reinforcement Learning + Large Model Memory: Mem-α, Enabling Agents to "Learn How to Remember" for the First Time

Share Short URL