2026-02-21

From Chatbot to Agent: The Engineer's Guide to LLM Memory Systems

7 min readAI DevelopmentTutorialsEngineeringPythonAI AgentsLLM MemoryRAGSystem ArchitectureVector Databases

A technical deep dive into building memory systems for AI agents. Learn the difference between short-term buffering and long-term vector storage, see actual implementation logic, and discover how to prevent memory drift and poisoning in production.

The Goldfish Problem

Here is the hard truth about the LLM API you are calling: it doesn't care about you, and it certainly doesn't remember you. Every time you send a request to GPT-4 or Claude, you are meeting for the first time. The model is stateless.

If you are building a simple chatbot, you can fake memory by sending the chat history back and forth. But if you are building an autonomous agent—something that needs to learn from past interactions, reference user preferences established weeks ago, or maintain a consistent persona over thousands of steps—the "pass the whole history" strategy breaks down immediately.

Context windows are expensive, and more importantly, they are finite. You cannot shove a gigabyte of user logs into a 128k window and expect coherence. You need a memory architecture.

In this post, we are moving past the basics. We are going to build a mental model for Agent Memory, look at the code patterns to implement it, and discuss the edge cases that break these systems in production.

The Anatomy of Agent Memory

To architect an agent, we borrow heavily from human cognitive science. We generally categorize agent memory into three distinct buckets:

1. Short-Term / Working Memory

This is the context window. It contains the immediate conversation history, the system prompt, and the scratchpad (intermediate reasoning steps). It is high-fidelity but ephemeral. If it slides out of the context window, it is gone forever.

2. Long-Term / Episodic Memory

This is your database. It stores events, facts, and interactions from the past. In modern AI stacks, this is almost exclusively handled by Vector Databases (RAG). By embedding text into vectors, we can perform semantic search to retrieve memories relevant to the current task, rather than retrieving recent ones.

3. Procedural Memory

This is often overlooked. Procedural memory is the agent's knowledge of how to do things. In code, this manifests as few-shot examples embedded in the prompt or specific tool definitions. It's static memory about capabilities.

Implementation Patterns: The Build

Let's look at how we actually code this. We will focus on a hybrid architecture: a Conversational Buffer for the immediate now, backed by a Vector Store for the long term.

Pattern 1: The Rolling Window (Short-Term)

The simplest form of memory is a deque (double-ended queue). You keep the last $N$ turns of conversation. When turn $N+1$ arrives, the oldest turn drops off.

from collections import deque

class ShortTermMemory:
    def __init__(self, max_turns=5):
        # Stores tuples of (User, AI)
        self.history = deque(maxlen=max_turns)

    def add_interaction(self, user_input, ai_response):
        self.history.append((user_input, ai_response))

    def get_context_string(self):
        return "\n".join([f"User: {u}\nAI: {a}" for u, a in self.history])

The Limitation: This creates the "Goldfish" effect. As soon as a crucial piece of information (like a user's name mentioned 6 turns ago) drops off the queue, the agent forgets it entirely.

Pattern 2: The Summary Layer

To mitigate the hard cutoff of the rolling window, we add a summarization step. Before a conversation turn drops out of the short-term window, we ask an LLM to condense it into a running summary.

Prompt Logic:
"Here is the current summary of the conversation. Here is the new interaction. Update the summary to include new key details while keeping it concise."

This allows the agent to maintain high-level context (e.g., "The user is debugging a Python script") without wasting tokens on the verbatim code blocks discussed ten minutes ago.

Pattern 3: The Hybrid Store (Long-Term Implementation)

This is where things get interesting. We need to store facts indefinitely. For this, we use a retrieval workflow. Here is the logic flow for a production-grade memory system:

User Input: The user sends a message.
Query Generation: The agent (or a lightweight model) generates a search query based on the input. (e.g., User: "Update that file." -> Query: "What file was the user working on recently?")
Retrieval: We search the Vector DB for semantically similar chunks from the past.
Context Injection: We inject the retrieved memories into the System Prompt.
Generation: The LLM generates a response.
Storage: We save the current interaction into the Vector DB for future retrieval.

Here is a simplified Pythonic representation of this loop:

def agent_cycle(user_input, vector_store, chat_history):
    
    # 1. Retrieval: Check long-term memory
    # We search for memories relevant to the CURRENT input
    relevant_docs = vector_store.similarity_search(user_input, k=2)
    context_str = "\n".join([d.page_content for d in relevant_docs])

    # 2. Construct Prompt
    system_msg = f"""
    You are a helpful assistant.
    
    LONG TERM MEMORY:
    {context_str}
    
    RECENT CONVERSATION:
    {chat_history.get_context_string()}
    """

    # 3. Call LLM
    response = llm.generate(system_msg, user_input)

    # 4. Write to Memory (The crucial step)
    # Save to short-term buffer
    chat_history.add_interaction(user_input, response)
    
    # Save to long-term vector store (usually done async)
    vector_store.add_texts([f"User asked: {user_input}. AI Answered: {response}"])
    
    return response

The Dark Side: Memory Pitfalls

Implementing the code above is easy. Making it work reliably is hard. Here are the three main issues you will face when building custom memory architectures.

1. Memory Drift

Over time, summaries of summaries become vague. If you summarize a summary 50 times, "The user wants a Python script to scrape LinkedIn" eventually drifts into "The user wants code."

The Fix: Keep "pristine" entities separate. Extract specific facts (Names, Dates, Tech Stack) and store them in a structured JSON object or SQL database alongside the unstructured vector memory. Always inject these hard facts raw, without summarization.

2. Context Poisoning

Retrieval isn't magic. Sometimes, a vector search returns irrelevant information because the keywords match, even if the context is different. If the user asks "Why is the server down?" and your RAG retrieves a memory about a server crash from 2021, the LLM might hallucinate that the 2021 solution applies today.

The Fix: Use Recency Weighting. When scoring vector matches, apply a decay function to the score based on the timestamp. A memory from 5 minutes ago should rank higher than a memory from 5 months ago, assuming semantic similarity is roughly equal.

3. The "I Know That" Loop

If you store every interaction in the vector database, the database fills up with repetitive conversational filler ("Hi", "Thanks", "Okay"). Retrieving this junk dilutes the context window.

The Fix: Implement a Salience Filter. Before saving an interaction to long-term memory, ask a cheap model (like GPT-4o-mini): "Does this interaction contain worthy facts to be remembered?" If no, discard it. Only save high-value interactions.

Conclusion: Building for Continuity

Memory is the difference between a tool and a teammate. A tool executes a command. A teammate remembers that you prefer TypeScript over Python, that you are currently migrating a legacy database, and that you hate verbose comments.

Start simple. Build a rolling window. Then, add a vector store for specific facts. Do not try to build a digital human brain on day one. Watch how the context window fills up, monitor the costs, and iterate.

Now, go build something that remembers.

Comments

Loading comments...

2026-02-21

From Chatbot to Agent: The Engineer's Guide to LLM Memory Systems

7 min readAI DevelopmentTutorialsEngineeringPythonAI AgentsLLM MemoryRAGSystem ArchitectureVector Databases

The Goldfish Problem

Context windows are expensive, and more importantly, they are finite. You cannot shove a gigabyte of user logs into a 128k window and expect coherence. You need a memory architecture.

The Anatomy of Agent Memory

To architect an agent, we borrow heavily from human cognitive science. We generally categorize agent memory into three distinct buckets:

1. Short-Term / Working Memory

2. Long-Term / Episodic Memory

3. Procedural Memory

Implementation Patterns: The Build

Let's look at how we actually code this. We will focus on a hybrid architecture: a Conversational Buffer for the immediate now, backed by a Vector Store for the long term.

Pattern 1: The Rolling Window (Short-Term)

The simplest form of memory is a deque (double-ended queue). You keep the last $N$ turns of conversation. When turn $N+1$ arrives, the oldest turn drops off.

from collections import deque

class ShortTermMemory:
    def __init__(self, max_turns=5):
        # Stores tuples of (User, AI)
        self.history = deque(maxlen=max_turns)

    def add_interaction(self, user_input, ai_response):
        self.history.append((user_input, ai_response))

    def get_context_string(self):
        return "\n".join([f"User: {u}\nAI: {a}" for u, a in self.history])

The Limitation: This creates the "Goldfish" effect. As soon as a crucial piece of information (like a user's name mentioned 6 turns ago) drops off the queue, the agent forgets it entirely.

Pattern 2: The Summary Layer

To mitigate the hard cutoff of the rolling window, we add a summarization step. Before a conversation turn drops out of the short-term window, we ask an LLM to condense it into a running summary.

Prompt Logic:
"Here is the current summary of the conversation. Here is the new interaction. Update the summary to include new key details while keeping it concise."

This allows the agent to maintain high-level context (e.g., "The user is debugging a Python script") without wasting tokens on the verbatim code blocks discussed ten minutes ago.

Pattern 3: The Hybrid Store (Long-Term Implementation)

This is where things get interesting. We need to store facts indefinitely. For this, we use a retrieval workflow. Here is the logic flow for a production-grade memory system:

User Input: The user sends a message.
Query Generation: The agent (or a lightweight model) generates a search query based on the input. (e.g., User: "Update that file." -> Query: "What file was the user working on recently?")
Retrieval: We search the Vector DB for semantically similar chunks from the past.
Context Injection: We inject the retrieved memories into the System Prompt.
Generation: The LLM generates a response.
Storage: We save the current interaction into the Vector DB for future retrieval.

Here is a simplified Pythonic representation of this loop:

def agent_cycle(user_input, vector_store, chat_history):
    
    # 1. Retrieval: Check long-term memory
    # We search for memories relevant to the CURRENT input
    relevant_docs = vector_store.similarity_search(user_input, k=2)
    context_str = "\n".join([d.page_content for d in relevant_docs])

    # 2. Construct Prompt
    system_msg = f"""
    You are a helpful assistant.
    
    LONG TERM MEMORY:
    {context_str}
    
    RECENT CONVERSATION:
    {chat_history.get_context_string()}
    """

    # 3. Call LLM
    response = llm.generate(system_msg, user_input)

    # 4. Write to Memory (The crucial step)
    # Save to short-term buffer
    chat_history.add_interaction(user_input, response)
    
    # Save to long-term vector store (usually done async)
    vector_store.add_texts([f"User asked: {user_input}. AI Answered: {response}"])
    
    return response

The Dark Side: Memory Pitfalls

Implementing the code above is easy. Making it work reliably is hard. Here are the three main issues you will face when building custom memory architectures.

1. Memory Drift

Over time, summaries of summaries become vague. If you summarize a summary 50 times, "The user wants a Python script to scrape LinkedIn" eventually drifts into "The user wants code."

2. Context Poisoning

3. The "I Know That" Loop

If you store every interaction in the vector database, the database fills up with repetitive conversational filler ("Hi", "Thanks", "Okay"). Retrieving this junk dilutes the context window.

Conclusion: Building for Continuity

Now, go build something that remembers.

Comments

Loading comments...

From Chatbot to Agent: The Engineer's Guide to LLM Memory Systems

The Goldfish Problem

The Anatomy of Agent Memory

1. Short-Term / Working Memory

2. Long-Term / Episodic Memory

3. Procedural Memory

Implementation Patterns: The Build

Pattern 1: The Rolling Window (Short-Term)

Pattern 2: The Summary Layer

Pattern 3: The Hybrid Store (Long-Term Implementation)

The Dark Side: Memory Pitfalls

1. Memory Drift

2. Context Poisoning

3. The "I Know That" Loop

Conclusion: Building for Continuity

Comments

Add a comment

From Chatbot to Agent: The Engineer's Guide to LLM Memory Systems

The Goldfish Problem

The Anatomy of Agent Memory

1. Short-Term / Working Memory

2. Long-Term / Episodic Memory

3. Procedural Memory

Implementation Patterns: The Build

Pattern 1: The Rolling Window (Short-Term)

Pattern 2: The Summary Layer

Pattern 3: The Hybrid Store (Long-Term Implementation)

The Dark Side: Memory Pitfalls

1. Memory Drift

2. Context Poisoning

3. The "I Know That" Loop

Conclusion: Building for Continuity

Comments

Add a comment