An Overview of Modern Memory Management Architectures in LLM Agents

13 min read5 days ago

–

Press enter or click to view image in full size

Imagine a smart assistant that doesn’t just answer your questions, but remembers every chat, every choice, and every bit of help it’s given over time. That is the power of memory-driven agents. These agents can recall past events, learn from them, provide more informed responses, and ultimately deliver a better user experience.

Introduction

LLM Agents are autonomous systems powered by Large Language Models (LLMs) to perform some task in an external environment. The agent can decide the control flow of the application, can decide which task to perform, etc, with or without any human intervention. Among the many components of an LLM agent, memory plays an important role.

Memory in Agents

When you ch…

13 min read5 days ago

–

Press enter or click to view image in full size

Introduction

Memory in Agents

When you chat with chat assistants like ChatGPT, Claude, or Gemini, the conversation typically belongs to a specific session. You might have also noticed that each session will have a title (usually a short description covering the main or initial conversation in the session). For every new query you ask in a session, all the previous chats (user-AI message pair) will be considered as a “Context” for the LLM to generate a better response.

Press enter or click to view image in full size

Why is context important?

Context memory is essential for maintaining a continuous conversation with your LLMs/Agents. Today’s large language models can handle very large context windows, which means the user can have lengthy sessions and have continuous conversations throughout the session. However, as large as it might seem, the fixed context windows have some drawbacks too.

As the conversation continues for a long time, you are more likely to hit the token limit error. Of course, you can start a new session and chat with the model, but the entire context will be lost
Even though the model can support up to, say, 128k tokens, the longer the context grows, the model struggles to find relationships between sentences and fails to generate the most appropriate answer. Memory management in LLM agents can help the user to have continuous conversations without worrying about the context length limit. It helps the agents to recall user preferences, previous conversations, experience, etc, and provide more helpful and personalised responses. There are various memory architectures being used currently.

Before moving forward and discussing various solutions, there are certain types of memories that we need to be aware of.

Short-term memory or working memory This is what we call as context window, which represents the data or context that the LLM uses to generate any response. We can call this the active information. If you are simply chatting with the LLM, the context can be your query or all the previous chats. But as discussed earlier, as the conversation becomes too long, the LLM might start to skip or forget some of the context.

2. Procedural memory

This includes the set of rules or guidelines that are passed to the LLM along with the context memory for generating a response. This is added to the LLM in the form of codes, configuration, or as prompts. You might have seen prompts (Usually called System Prompts) that are passed along with the user query, mentioning some baseline rules for the LLM – For example, “You are a helpful assistant.”

3. Long-term memory

Long-term memory is responsible for storing information for an extended period of time. These include the basic facts, key events, etc., of a user that needs to be stored and can be recalled whenever needed. Information like the user’s birthday, other basic information comes under long-term memory. Long-term memory allows an LLM agent to maintain meaningful context across conversations, grounding the answers using the previous discussions and user preferences, resulting in more context-aware and relevant responses.

This again comes in two different types:

3.1 Episodic memory

Episodic memory includes past events, experiences, etc. This is important for an LLM agent to remember past interactions with a particular user for a particular task. This is usually stored in an external data source and not in memory.

3.2 Semantic memory

This includes key facts and their relationships with other information that help the agent to ground its responses. In a personalized agent, this can include facts, preferences about the user, and other details like location, etc. This information helps the agent to ground its responses and prevents it from hallucinating. Just like us, humans use the world’s knowledge as a base of information for our daily tasks and responses. Just like episodic memory, this is also often stored as Vector embeddings or as a Knowledge Graph in external data sources.

Adding memory to LLM Agents

Now, it is clear that we cannot rely solely on the context window for an agent to perform prolonged conversations or tasks. Even though this works really well for chatbots, where conversations are contained in multiple sessions, and each session can hold up to some conversations.

There are different ways to manage memory in LLM agents. Let’s see each one in detail now

Improving the short-term memory storage

Message Buffer with recent Conversation The message buffer stores the most recent messages in a conversation and uses that as the context for response generation

2. Compression with Summarization

One way to solve the finite context length issue is to use a summary of all the previous context instead of the entire context. The summarization module can run independently and refresh after a certain period of time to always fetch the most updated summary when needed.

Both the above techniques work well in scenarios where the conversation length is limited and exists within a session. But LLM agents that need to persist information across multiple sessions will mostly fail to provide the desired user experience here. A mechanism to dynamically organize long-term and short-term memory, a better representation of the stored memory, periodically update/add/delete the stored memory, etc, is required for an LLM agent to establish relationships between the conversations.

Foundation Approaches to Memory Management

RAG-based memory management RAG (Retrieval-Augmented Generation) is the foundation for many memory management architectures today. RAG allows the AI agents to fetch relevant information from external sources based on the user query.

Here, short-term memory and long-term memory are stored separately. Short-term memory, including recent queries and AI responses, system prompts, etc, will be stored in active memory. Long-term memory includes the entire conversation history, including user facts, preferences, events that occurred, and other information that the agent acquired over time. The long-term memory will be stored in external databases like a vector data store as embeddings.

When the user sends a query, first, the query is converted into an embedding vector. This new query, along with the recent “m” messages, system prompt, etc, will be concatenated. Now, the recent query is passed to the retriever module to fetch the most similar memories from the vector database. This is done using semantic similarity search. The retrieved long-term memory, along with the in-memory context messages, will be passed to the LLM to generate the response.

However, just the retrieval and generation are not enough for managing the memory in an agent. The agent’s memory should continuously evolve and update based on the past interactions. This means the system should also update the memory module with the new interaction details, and if any conflicts are detected, it will be updated.

2. RAG with Knowledge Graph

A knowledge graph uses entities and edges to represent data and relationships between data. Nodes will have a label that they used to group. A relationship has a type and a direction. Both nodes and edges store data as well.

Press enter or click to view image in full size

A simple representation of Knowledge graph This will store the information in a structured format rather than in a dense vector format. The advantage of storing data is that, for each data, we can retrieve or consider all other data that are related to provide more contextual information.

This also enables faster retrieval of memories for a query. Here also, to store the data, we need to convert it into embeddings, then store the embedding data as nodes in the graph. After that, the edges will be created, establishing various relationships among the nodes. Once the knowledge graph is set, it can be used as a retriever to do a similarity search. One popular vector store for storing the knowledge graph is Neo4J.

Similar to conventional RAG-based memory management, here also we can treat short-term memory and long-term memory separately and merge them during the inference to generate the LLM response.

Improving long-term memory storage

1. MemGPT

MemGPT introduces a hierarchical memory system for LLM agents inspired by traditional operating system (OS) memory management architectures. MemGPT distinguishes between two primary memory layers: the Main Memory, analogous to the RAM or physical memory accessible during computations, and an External Memory, comparable to disk storage, which holds a larger volume of out-of-context or archival information.

For every query, the LLM agent dynamically decides which contextual information needs to be loaded into the Main Memory to stay within the model’s fixed context window. This decision-making is managed autonomously through specialized functions that govern the searching, organizing, updating, and accessing of memory content based on temporal factors, relevance, semantic similarity, and conversational priority. This self-directed memory management enables the agent to mimic OS-like paging and swapping operations within its memory, optimizing the use of a limited context size while maintaining extensive knowledge outside the immediate context.

The key aspects of MemGPT’s memory model include:

Hierarchical Memory Architecture: The system segments memory such that the fast-access Main Memory contains a focused, limited context window, while the External Memory serves as slower, larger storage for long-term or less immediately relevant knowledge.
Function-Based Memory Control: MemGPT empowers the LLM itself to invoke memory management functions, effectively performing self-directed read/write operations and memory transfers between Main and External memory regions. This includes updating stored memories or recalling archival data when needed during processing
Autonomous Memory Editing and Retrieval: The LLM autonomously decides when to move conversation history or other data from External Memory into the Main Memory based on the tasks and goals, maintaining an up-to-date and dynamic working context.
Context Window Extension: By leveraging this hierarchical and dynamic memory movement, MemGPT overcomes the fixed context window limitations, allowing conversations and reasoning over substantially extended interaction histories or large documents.
Inspiration from OS Concepts: MemGPT borrows OS event loop, interrupt handling, and function chaining principles to create an AI system that integrates memory management tightly with inference cycles, enabling efficient handling of memory-related operations alongside user interactions

2. Langmem

Langmem is an advanced memory management system within the Langchain ecosystem, empowering agents with persistent long-term memory across sessions and tasks. Its foundational feature is a core memory API, which allows integration with any storage backend and agentic framework, facilitating flexible deployment and interoperability with popular systems.

The components include two specialized tools:

create_manage_memory_tool: Enables agents to add, update, and delete long-term memory in real-time.

create_search_memory_tool: Allows agents to search existing long-term memory for relevant past information during active conversations, using semantic or metadata-based queries.

Langmem’s memory management operates “in the hot path,” recording and retrieving critical information during the conversation, ensuring agents remain context-aware and responsive without waiting for session completion. The memory manager employs LLMs to extract, summarize, and update long-term memory entries based on conversation importance and user preferences.

Langmem also has other features like:

Background Memory Manager: Consolidates and enriches agent knowledge asynchronously, supporting scalable updates and knowledge integration.
Prompt Optimizer: Uses prior conversation history and advanced optimization methods, such as meta-prompting and evolutionary gradients, to refine the agent’s system prompt and improve alignment over time.
Native LangGraph Integration: Enables default persistence and hierarchical namespace management for organizing agent, user, or organizational memory across contexts.
Support for Multiple Memory Types: Langmem formally supports semantic, episodic, and procedural memory, facilitating tailored recall and reasoning depending on application needs.
Rich Structured Storage: Stores information with metadata and enables flexible retrieval modes, including direct, semantic, and contextual filtering. Langmem revolutionizes agentic memory by combining “hot path” responsiveness, background enrichment, and powerful prompt optimization methods, making LLM agents more consistent, personalized, and increasingly human-like in their conversation abilities.

3. Mem0 and Mem0g

Mem0 is another architecture that dynamically extracts, updates and evaluates, and stores information from the conversations using dedicated memory modules. Mem0g is a graph-based memory representation that was built on the foundations of Mem0, which helps to capture more complex relationships and better retrieval of the conversations.

3.1 Mem0

The Mem0 architecture mainly consists of two phases. The first phase is the extraction phase. Like the name says, this part is responsible for extracting the most relevant context information from the conversations using a dedicated history for a given user query

When the user ingests a new query, the extraction phase first prepares the content required for information extraction. This content includes

The latest message pair (mt-1, mt), where mt represents the current user message and mt-1 represents the previous message
A sequence of past n recent messages represented as {mt-n, mt-n+1,…mt-2}, where n can be a hyper parameter
The conversation history summary, S retrieved from the database that represents the semantic content of the entire conversation history. This summary is generated independently using an asynchronous summary generation module, and the generation happens periodically without interfering with the main pipeline This content is used to prepare a comprehensive prompt,

P = (S,{mt-n, mt-n+1,…mt-2}, mt-1, mt)

Finally, this prompt is given to an extraction module powered by an LLM to extract the salient memories Ω, where,

Ω = {w1,w2,m3..,wn}

The next phase is the Update Phase, which takes these extracted salient memories Ω and evaluates them. For each salient memory “wi”, the top “s” semantically similar memories are extracted from the database. Now, the salient memory and similar memories are given to an LLM, which evaluates the existing memories. Based on the evaluation, the LLM calls one of its available tools to execute the next step. The tools include:

ADD for creating new memories if there are no similar memories extracted. This means the new information should be added to the memory database
UPDATE for changing the existing memories. This happens if the current information complements the existing ones. The agent should be updated with the recent information
DELETE for removing the memories that are contradicted by the new information.
NOOP when there is no operation required Here, for storing the memories, a vector database with dense representations is utilised.

3.2 Mem0g

Mem0g uses a graph based memory architecture which represents the data in the form of a labelled graph G = (V, E, L), where V represents the nodes (Alice, San_Francisco, etc), E represents edges or relationships between the nodes (lives_in, works_at, studies_in, etc) and L represents Labels which are used to give semantic meaning to the nodes (Person, Place, etc.).

Here, first, an entity extractor module processes the input message from the user and extracts the set of entities along with the labels. This is done using an LLM. Then, another LLM-driven module — a relationship generator module derives a meaningful relationship between the entities using the information in the message. Once we get the triplets – (V, E, L), the Mem0g uses a hybrid approach to extract similar information from the database using an entity-centric method and a dense semantic search method. In the entity-centric method, similar entities or nodes are extracted using similarity metrics first. Then it extracts the subgraph using these similar nodes, by including both the incoming and outgoing relationships from these nodes. In the dense similarity search method, a dense embedding vector is calculated for the query, and then this is matched with the embedding vectors of each relationship triplet in the graph database. For the graph database, Mem0g utilizes Neo4j.

In the second phase, the update phase, a conflict detection mechanism is included to identify any conflicting relationships within the graph as new data comes. Post that, an LLM-based update resolver determines if any changes need to be made for any relationships.

4. A-mem

Although we saw some cases where graph databases are used for memory storage, to provide a structural organization of data, many of these rely on predefined schemas and relationships. This strictness will make them less adaptable and dynamic to represent a wide data formats. For example, consider a maths problem, where the problem statement, various solution steps, intermediate results, and related concepts may not fit neatly into fixed graph schemas.

The difficulty in generalizing the data representation across multiple environments was the main motivation for A-Mem. A-Mem can represent data in dynamic structures without relying on static, predetermined schemas.

A-mem also works in multiple phases

a. Note construction

Whenever a new message comes in, an LLM-driven module will construct a structured memory note, which is represented as mi = {ci, ti, Ki, Gi, Xi, ei, Li}. All these memory notes are stored in a collection M.

For each mi,

Ci - the current message of interaction
Ti – current timestamp
Ki – LLM generated keywords
Gi – tags generated for the message by the LLM that are used to categorize the content
Xi – LLM generated contextual description
Ei – Dense vector representation of the concatenated result of ci, Ki, Gi, Xi
Li — set of memories that are linked to the current query based on semantic similarity b. Link Generation

When a new memory note is created, the top similar notes are first retrieved based on a semantic similarity search. Then, an LLM is prompted to construct meaningful connections between similar memories. The links are created more flexibly without any predefined rules or structures.

c. Memory evolution and memory retrieval

Once the links are created for the new memory, the A Mem then checks if it has to update the similar memories based on the new information and connections, thereby ensuring that the most updated memories will be stored in the database. The final stage is to prepare the context-aware memory for the user query for the LLM to generate the response. This is done by retrieving the most similar memories using similarity search on the dense embeddings. The top k relevant and updated memories are retrieved for the user query to construct the final prompt.

Conclusion

The diverse approaches in memory management architectures such as Mem GPT, Mem0, Mem0g, Langmem, and others highlight the evolving landscape of agentic memory for large language models. These advanced architectures will pave the way for more capable and context-aware LLM agents by enhancing long-term agentic memory and unlocking new potentials in AI applications.

Thank you for reading this post! Let me know if you liked it, have questions, or spotted an error. Please feel free to contact or follow me through LinkedIn, Twitter, or Medium.