skip to content
barnickle.dev

Three Tiers of Memory for Persistent AI Agents

·

11 min read

Table of Contents

The Memory Problem

Persistent AI agents have a fundamental constraint that stateless chatbots don’t: they need to remember, but they can’t remember everything.

Any agent that maintains state across sessions — a customer service agent that tracks user preferences, a domain expert that accumulates case knowledge, a coordinator that remembers team context — faces the same problem. One of our AI agents — an NPC in a tabletop RPG, responding in-character to a player who couldn’t quite recall how they’d arrived — said this:

“You don’t have to remember everything to be here. The Nexus doesn’t seem to require it. But in my experience, the things people can’t quite remember are usually the things they’re not quite ready to look at yet.”

Nobody wrote that line. It emerged from a character file, a conversation history, and a well-timed prompt. It’s also a surprisingly good description of the engineering problem underneath: what does an AI agent need to remember, what can it afford to forget, and how do you make retrieval feel like recall instead of lookup?

The naive approach is to stuff everything into the system prompt. It works until it doesn’t. Around interaction 20, the prompt is bloated with stale context. By interaction 50, you’re burning tokens on conversations from three days ago that have no relevance to the current exchange. And you’re paying for every token on every call.

I’ve been building persistent AI agents as a side project — characters for a tabletop RPG that maintain relationships and respond in-character across multi-day sessions. The domain is niche, but the memory constraints are universal. The solution we landed on is a three-tier architecture where each tier serves a different purpose, operates on a different timescale, and is accessed through a different mechanism.


Tier 1: Working Memory

What: The last N interactions, verbatim, injected directly into the system prompt.

Why: The agent needs immediate conversational context. If a user said something two messages ago, the agent should remember it without having to search. This is the short-term memory that makes conversation feel continuous.

How: A memory engine pulls the most recent interactions from the agent’s state file and injects them into the system prompt alongside two other hot-memory sections:

  • Relationship summaries — current rapport, trust level, and known facts per user
  • Learned knowledge — the top 20 facts this agent has learned, sorted by recency, categorized by type

The prompt assembly order matters:

Base context (role, domain, rules)
→ Agent identity document
→ Scope/authorization rules
→ Relationship summaries
→ Recent interactions (last 5)
→ Learned knowledge (top 20)
→ Behavioral constraints
→ Response instructions

In our case, “identity document” is an NPC character file and “scope rules” are clearance levels. For a support agent, identity might be a brand voice guide and scope might be product knowledge boundaries. The structure is the same.

This is the most expensive tier — every token here is paid on every API call. That’s why it’s capped aggressively. Five interactions, not fifty. Twenty knowledge entries, not all of them. The constraint isn’t technical, it’s economic.


Tier 2: Episodic Archive

What: Compressed episodes from past interactions, accessed on-demand via a tool call during conversation.

Why: Agents need to recall significant moments from days or weeks ago, but can’t afford to carry every past interaction in the prompt. A user might reference a conversation from last week, and the agent needs to access that context — but only when triggered, not preloaded.

How: This tier has two components: a compressor that creates the archive and a tool that queries it.

The Compressor

The memory compressor runs as a background job. For each agent with more than 15 accumulated interactions:

  1. Keep the last 10 interactions verbatim (they stay in tier 1)
  2. Send older interactions to Claude in batches of 15
  3. Claude scores each interaction for importance (1-10)
  4. Related interactions are grouped into episodes — named, summarized, with key quotes preserved
  5. Important facts are extracted per category and merged into learned knowledge

The compression prompt asks Claude to think like a memory, not a summarizer:

Score each interaction for importance. A 10 is a moment this agent would never forget. A 1 is routine small talk. Group related interactions into episodes and give each a name that captures how this agent would remember it.

High-importance episodes (score 7+) are also saved into an important_archive within the agent’s state file, providing backward compatibility with the tier 1 history method. The combined result: recent interactions are verbatim, older important ones are compressed summaries, and the most recent always sort last.

The Tool

Each agent gets a recall_memory tool:

recall_memory: {
name: 'recall_memory',
description: 'Search your own memories of past interactions. Use when
someone references a previous conversation, when you want to recall
specific details about a person, or when something reminds you of
a past moment.',
input_schema: {
properties: {
query: { type: 'string', description: 'A name, topic, event, or feeling' },
participant: { type: 'string', description: 'Filter by person name' },
},
required: ['query'],
},
}

The handler searches the episodic archive by keyword and participant match, returning the top 3 episodes with their summaries, key quotes, and emotional context. The agent decides when to use it — the tool description is framed in first person (“search your own memories”) so the model treats it as introspection, not database lookup.

This is the critical design choice: memory retrieval is an agent action, not a system operation. The agent doesn’t passively receive old context. It actively reaches for a memory when something triggers recall. This produces more natural conversation — the agent pauses, connects past to present, and responds with the texture of genuine recall rather than the flatness of a database query.

For our RPG characters, this means an NPC might hear a player’s name and reach for a memory of their last conversation — or choose not to, which is itself a meaningful behavior. For a support agent, it might mean recalling a previous escalation with the same customer. The mechanism is identical.

The cost model is different from tier 1. Recall costs one tool-use round trip, but only when triggered. Most interactions don’t need it. When they do, the 3-episode retrieval adds maybe 300-500 tokens — far less than stuffing all past interactions into the prompt.


Tier 3: Permanent Log

What: Every interaction, appended to a date-partitioned JSONL file. Never read at runtime.

Why: Background analysis needs the complete, uncompressed record. The behavioral reviewer and state evolver consume these logs to flag problems and extract emerging patterns. You can’t review what you didn’t record.

How: After every interaction, the memory engine appends a JSON line:

{
agentId: 'agent-1',
timestamp: '2026-03-02T15:30:00.000Z',
channel: 'support-escalation',
user: 'user-42',
user_message: '...',
agent_response: '...',
// Enrichment fields
model: 'claude-sonnet-4-6',
usage: { inputTokens: 2100, outputTokens: 340, toolCalls: 1 },
selection_reason: 'direct_mention',
tools_used: ['recall_memory'],
response_length: 487,
approval_status: 'auto_approved'
}

Date-partitioned means each day gets its own file. This makes it trivial for background jobs to find unprocessed logs — track the last processed date, scan forward.

The enrichment fields are the part we got wrong initially. The logging code supports model info, token usage, selection reason, and approval status. But we forgot to wire the enrichment data from the runtime modules to the log call. Our first review of 44 interactions found 0% enriched metadata — empty context fields across the board. The analysis still worked. It just couldn’t correlate response quality with model selection or token budget.

The permanent log is also the source material for the memory compressor. The data flows in one direction:

Runtime interaction
→ Tier 1 (working memory, prompt)
→ Tier 3 (permanent JSONL log)
→ Memory Compressor → Tier 2 (episodic archive)
→ Behavioral Reviewer → quality reports
→ State Evolver → knowledge proposals

Tier 3 feeds tier 2. Not the other way around. The compressor reads the raw interaction records and produces compressed episodes. The permanent log is the source of truth; the episodic archive is a derived view optimized for runtime recall.


What the Three Tiers Optimize For

TierLatencyCostCompletenessUsed by
Working memoryZero (in prompt)Every callLast 5 + top 20 factsAgent runtime
Episodic archiveOne tool callOn demandCompressed episodes, key quotesAgent runtime (via recall_memory)
Permanent logN/A (offline)Background jobs onlyEverything, uncompressedReviewer, evolver, compressor

Each tier makes a different tradeoff. Working memory prioritizes latency and pays for it in cost and completeness. The episodic archive balances cost and access — you only pay when the agent needs to remember. The permanent log prioritizes completeness and pays nothing at runtime because it’s never read during conversation.


The Compression Decision

The hardest design question was what to compress and what to keep.

One of our RPG agents — a guarded, analytical character — produced this exchange after helping a new user through a difficult moment:

“I don’t go anywhere. Sleep. The Nexus holds you while you do. I’ve never understood how — but it does. I’ll be here.”

Five sentences. Almost nothing happening on the surface. But in context — after thirty minutes of building trust — this is the moment the relationship solidified. The compressor scores it a 9. The exact quote gets preserved, along with the emotional context and what preceded it.

A routine greeting from the same agent (“Hello.” / “Hello.”) scores a 2. It gets compressed into an episode summary or dropped entirely.

The compressor’s importance scoring is the gate.

The categories matter too. Facts learned about users go into one category, domain knowledge into another, personal context into a third. This categorization means the prompt assembler can inject relevant knowledge by type, not just by recency. When a specific user re-engages, the agent’s knowledge about that user is prioritized over general domain facts.

What I’d change: the importance threshold should be tunable per agent. An agent with a reserved personality might treat a rare emotional moment at importance 6 as highly significant, while an expressive agent might need importance 9 for the same weight. The current system treats all agents the same, and it shouldn’t.


What Makes This Work (And What Doesn’t Yet)

The tool-call pattern for memory retrieval is the best decision we made. Framing recall as an agent action produces responses that feel like genuine memory — the agent pauses, reflects, and connects past to present. It also means the agent can choose not to recall, or recall selectively, which is itself a meaningful behavior.

The cost structure scales. Adding more agents adds more working memory sections but doesn’t change the per-call cost of any individual agent. The episodic archive grows but only costs when queried. The permanent log grows forever but costs nothing at runtime.

The enrichment gap is real. Without model metadata in the logs, the background pipelines can’t correlate response quality with model selection or token budget. We designed for enrichment and then didn’t connect the wires. This is the most impactful fix remaining.

The compressor should run automatically. Right now it’s a manual CLI command. After every session, someone remembers to run it — or doesn’t. A post-session hook or a threshold-based trigger that fires when any agent exceeds the interaction limit would close this gap.


The Broader Pattern

Strip away the specific implementation and the architecture is general:

Tier 1 — Hot context, always present, aggressively capped. This is the information your agent needs on every call. Cap it by recency, relevance, or importance — but cap it. Uncapped context is how you get $50 API bills and degraded output quality.

Tier 2 — Warm context, available on demand, retrieved by the agent itself. Give the agent a tool to search its own history. Let it decide when retrieval is necessary. This is cheaper than stuffing everything into the prompt and produces better results because the retrieval is contextually motivated.

Tier 3 — Cold storage, complete, never touched at runtime. Log everything. Enrich it at write time. Consume it with background pipelines that run on their own schedule. The runtime agent doesn’t need to know this tier exists.

The bridge between tiers matters as much as the tiers themselves. The compressor that turns raw logs into episodes. The reviewer that turns logs into quality signals. The evolver that turns logs into state updates. Without the bridge, tier 3 is just a growing pile of JSON that nobody reads.


This is part of an ongoing series on building AI agents for interactive worlds. Previous: The Empty Tables, Personality as Policy. See also: When AI Reviews AI.