AI Agents: Memory, Persistence and Checkpointing - Part 2
This is fully written by me without any help from AI, except for learning the concepts. This blog covers everything everyone should be aware of in the AI world, and it took me weeks to learn it so I could make it easy for you.
To understand the basics of agents and how to build one, check out Part 1 of the Agentic AI series here.
Overview
In Part 1, we read about the following:
- What are LLMs and how do they work
- How LLMs transform into agents to do real-world work
- Limitations of LLMs and AI Agents
In this blog, we will be targeting one of the major limitations of LLM-based agents - memory, persistence, and checkpointing.
LLMs are stateless, so they don't remember their past state and chats whenever they are called(through APIs), yet tools like OpenAI ChatGPT, Claude Code, Codex, etc. remember:
- The whole chat context from the very first message to the current message, even after days or weeks
- History of all the threads you have worked on with the AI tool
- The personalized facts about you
- The ability to ask you a question back, wait for you indefinitely, and still remember the chat when you come back with a reply
In this blog, you will read about how all these cases are handled in Agentic AI tools at a large scale.
Persistence
As we know, REST APIs are stateless, yet we maintain all the state in DBs or local filesystems to serve the data in APIs.
In a similar way, we use different types of persistent stores, including DBs and local filesystems, to serve various use-cases that require agent "memory".
It is mainly on 3 levels:
- Session State(Active Conversation)
- Thread Persistent(History)
- Semantic Memory(Personalized information) aka "smart" memory
1. Session State(Active Conversation)
This is the working memory of the conversation you are having with an active agent and is only applicable for short-term, ranging from seconds to minutes.
This only stores the current input at the time you are typing or loading it, including environment variables like user name, user's permissions, etc., before they are sent to LLM APIs.
The session state is mostly stored in RAM or in-memory caches like Redis, depending on the use case and size, and is persisted in the next layer(thread persistence) as soon as you send it as input to LLM.
2. Thread Persistent(History)
Whole threads, including user inputs, LLM outputs, tool results, etc., are stored for the longer term, from hours to months, for history maintenance, continuity, and crash recovery.
As soon as the input is fully processed and sent to the LLM, or the output comes as a response from the LLM API or different tools, it is persisted for long-term use.
We can persist multiple threads in the same way, not just a single thread. This is how Claude Code and Codex offer resume functionality.
Whenever someone resumes a session, it is loaded back in the memory and used as the current context for future LLM calls.
We can use local files to store these sessions, like Claude Code does, or use MySQL and NoSQL DBs like SQLite, MongoDB, mongo DB etc, which are used by ChatGPT-like tools, which are hosted services in nature.
One tricky thing to handle is stale data.
There can be a case where you ask: "Should I take an umbrella for shopping?" to LLM, it will call the "get current weather" tool for your location, the tool returns the current weather conditions, and using that, LLM figures out if you require an umbrella or not.
This conversation is stored in the persistent layer and can be resumed at any time in the future.
When this thread is resumed after a few days, the LLM knows that you need the response of the "get current weather" tool to figure out the answer, and it also sees that the tool was called earlier, so it knows the answer; thus, there is no need to call that tool again.
But along with the tool call history, it also maintains the timestamp of the conversation.
So, when a session is resumed, along with the past conversation, we also send a special prompt to LLM to identify a stale tool that needs to be "re-called".
With the above method, the past history/active context is updated, and the session is resumed.
Human in the loop(HITL)
Another tricky thing to handle is when LLM hits a sensitive tool that requires human intervention to continue, which can take minutes to hours, and it doesn't want the server to be blocked until it gets the approval back.
In case the LLM hits a sensitive tool for user approval, it persists the session and marks it "waiting for approval", sends a notification(on Slack, or UI popup, etc.) along with the session ID and required approval message.
Once the user approves, it triggers the webhook to initiate the session back, load it from the persistent store, and start the execution again.
3. Semantic Memory(Personalized information) aka "smart" memory
This is the smart memory that the agent creates for its future use for personalization.
An example would be how ChatGPT Pro works, which has context of your past conversations, and even if you create a new context window, it will still know "what you do", "what you want in life", "what you dislike about your current job", etc., if you have talked about these earlier.
A naive way to do this is to store all past conversations as they are, and load everything back in the current context whenever the user creates a new chat window.
The problem here is cost and performance when the scale is large, and users have 10s of thousands of conversations about all the different topics that can exist in the world.
Loading all these will not just be costly, but it will pollute the fresh context with information that is not relevant to the current window. Also, it will fill up the limits of the context window easily.
Another naive way to do this will be when someone asks, "how they should progress in their career", the agent will query the DB with keywords like "career", "progress", etc., to find what might not be available in their history, but things like "I am a software engineer" might be, which will be missed in the keyword based search.
How to solve this scalably?
Vector DB
Vector DB is the basic building block of memory in agents.
This is also the basic building block, RAG(retrieval augmented generation), that we will explore more about in the next blog.
Unlike traditional DBs, which store and index on textual and numerical data, a vector DB works on arrays of floating points called embeddings.
An embedding represents the coordinates of a word or text in an N-dimensional space, where N is the length of the array.
Any vector that is closer in space will have a similar semantic meaning as compared to a vector that is farther in space.
The distance is calculated using a simple formula for cosine similarity.
Ingestion in DB
There are embedding models available, like "text-embedding-3-small," which can convert any text to an array like [1.1, 73, 61, 44.2, 12.11].
Once converted, this is persisted in the vector DB along with metadata like the text corresponding to it, timestamp, etc.
Querying from DB
We use the same embedding model to convert user input to a vector and query the DB for the closest K neighbours using the cosine similarity formula, extract their text and other metadata, and return it.
How LLM-based agents use vector DB?
Factual data about the user is derived from their conversations.
For example, if a user has a chat, "I am a software developer, how should I proceed in my career?"
The agent extracts {"work": "software developer"} and persists it in the vector DB as a fact about the user.
The persistence can happen in multiple ways.
For agents that run on-demand locally, such as Claude Code, which only runs when a user runs the "claude" command on the terminal, these are generally loaded at the time SessionStarts hooks are run and persisted in a lightweight background process using hooks like SessionEnd or PostToolUse, which gets triggered once the history is persisted to analyse the session.
For agents that run on the server side, such as ChatGPT, a scheduled cron-like process runs in the background to go through the conversations, find out the facts, and persist those in a vector DB for future use cases.
This is how, if you ask ChatGPT, "What is the best time to go to my hometown in the next 3 months?", it itself computes from its own memory and different tools:
- Where is your hometown?
- What weather do you like more?
- What is your current location?
- What job do you do, and are there any upcoming holidays for you?
- Is your birthday coming and do you like to celebrate it with your family?
- Use weather tools to get the weather for the next 3 months.
- Use the flight tools to get prices from your current location to your hometown.
and suggests to you, based on all of the above and much more, the best time for you to visit your hometown.
Comments
Post a Comment