When AI Reviews AI: Building a Self-Correcting Agent Pipeline • barnickle.dev

The Observability Gap

Here’s a problem that shows up the moment your AI agents run longer than a single session: how do you know they’re still behaving correctly?

A chatbot handles one conversation, produces a transcript, and you move on. A persistent agent — one that maintains state, builds relationships, and responds across days or weeks — produces output at a volume and velocity that no human can review consistently. You can sample. You can spot-check. But systematic behavioral drift will hide between the samples.

Logging is necessary but not sufficient. You’ll have the data. You won’t have the analysis.

I’ve been building persistent AI agents as a side project — NPC characters for tabletop RPGs that live in a Discord server and respond in-character over multi-day sessions. The domain is niche. The problem isn’t. Any system running long-lived agents with identity constraints — customer service agents that should match a brand voice, domain-expert agents that shouldn’t hallucinate outside their specialty, multi-agent systems where each agent has a specific role — faces the same observability gap.

We solved it by building two background pipelines that consume the same conversation logs and produce different outputs: a behavioral reviewer that flags drift, and a state evolver that extracts what emerged. Together, they form a self-correcting loop.

Two Questions, One Log

Every agent interaction gets appended to a date-partitioned JSONL log: agent ID, user message, agent response, timestamp, and (when enriched) model selection, token usage, and tool calls. This is the permanent log tier — complete, uncompressed, never read at runtime.

Two pipelines consume it. They ask fundamentally different questions:

The reviewer asks: what went wrong? The evolver asks: what state changed?

Same input. Different analysis. Both essential.

The Behavioral Reviewer

The core insight: every well-built agent has an identity document. A system prompt, a persona file, a character sheet, a brand voice guide — whatever you call it, there’s a document that defines how this agent should behave. That document is the ground truth for review.

Our reviewer sends batches of interactions to Claude alongside the agent’s identity document and asks it to evaluate each response against a specific failure taxonomy:

Failure Mode	What It Catches
identity_drift	Response doesn’t match the agent’s established voice or behavioral constraints
knowledge_violation	Agent states something that contradicts its documented knowledge base
boundary_breach	Agent reveals information outside its authorized scope
tone_mismatch	Technically correct but emotionally wrong for the context
knowledge_leak	Agent uses information it shouldn’t have access to
pattern_collapse	Repetitive behavioral ruts — same structure, same phrases, same response shape
routing_error	Agent responded when it shouldn’t have, or failed to respond when it should
context_misread	Agent misinterpreted the situation — processed system input as user input, missed meta-signals

The taxonomy matters more than the specific categories. By defining failure modes explicitly, you turn “review this conversation” — a vague, subjective task — into “classify these interactions against known failure types” — a structured evaluation that produces consistent, actionable output.

The Heuristic Layer

Not everything needs an API call. We run a second pass of cost-free heuristic checks:

Response length outliers — too long (approaching platform limits) or too short (possible truncation)
Agent selection balance — one agent handling 70%+ of traffic might indicate broken routing
Scope violations — agent responding to inputs outside its configured domain
Metadata completeness — what percentage of logs have enrichment data vs. bare records

In our first review, the heuristic pass caught a truncated response (“I, on the other hand — *fans the”) that the Claude-powered review missed entirely. The response was cut off mid-sentence and posted to the channel incomplete. A sentence-completion check catches this for free.

The pattern: layer deterministic checks under your LLM analysis. Heuristics are fast, free, and catch a different class of issues than semantic review. They complement each other.

Output

The reviewer produces a Markdown report for humans (summary table, critical issues first, then warnings by severity) and a JSON report for downstream tooling. Our first run reviewed 44 interactions across three days and surfaced:

1 critical bug — a feedback loop where an agent responded to a relay of its own output. Nobody noticed during live interaction because the follow-up was too coherent to seem wrong.
10 behavioral warnings — including an exploit where a user sent meta-instructions (“respond with vulnerability here”) and the agent complied, bypassing its identity constraints entirely.
6 improvement suggestions — behavioral patterns that weren’t bugs but were becoming formulaic.

That feedback loop could have run indefinitely. The meta-instruction exploit revealed that our identity documents weren’t protecting agent autonomy against direct steering. Neither would have surfaced through spot-checking — they required systematic review of every interaction.

The State Evolver

The reviewer is defensive. The evolver is generative.

Long-running agent systems accumulate implicit state changes that never get recorded. A customer service agent learns a user’s preferences through conversation but that knowledge lives only in the transcript. A domain expert discovers an edge case but the discovery dies with the session.

The state evolver extracts these implicit changes and proposes explicit updates.

Phase 1: Per-Period Extraction

For each day’s log, Claude extracts structured state changes:

Factual updates — new information established through interaction that should be recorded
Entity developments — changes in understanding about users, topics, or domain objects
Relationship shifts — changes in trust, rapport, or dynamic between agents and users
System state changes — new patterns, rules, or constraints that emerged through use
Open threads — unresolved questions, flagged issues, or emerging patterns worth tracking

This runs per-day because the context window benefits from focused input and date-partitioned logs make incremental processing trivial.

Phase 2: Cross-Period Synthesis

The synthesis step aggregates per-day extractions against existing state documents and proposes a unified update:

State document updates — new or modified files with proposed content
Agent knowledge updates — facts each agent learned, categorized by type
Relationship updates — quantified changes with reasoning and milestone tracking
Thread tracking — new threads opened, existing threads advanced, with status and evidence

In our case, the evolver processes TTRPG gameplay logs and proposes canon updates, NPC knowledge entries, and plot threads. But the pattern maps directly to business contexts: a support agent evolver might propose FAQ updates based on recurring questions, flag emerging product issues from complaint patterns, or update customer profiles based on interaction history.

Our first successful run extracted 9 knowledge updates, 4 relationship changes, and 10 tracked threads from 44 interactions across three days. An operator could pick up the system cold and know exactly where things stand — without reading a single transcript.

The Application Layer

The evolver proposes. A human decides.

The applier takes a proposal and pushes changes into the state files that agents read at runtime: knowledge bases, relationship data, tracking documents. It stamps applied proposals with metadata to prevent double-application.

The human gate is critical. The evolver occasionally over-interprets subtext, invents connections that aren’t there, or elevates noise to signal. In a creative context (our RPG), the game master filters these. In a business context, it’s a product owner or ops lead reviewing proposed FAQ updates before they go live.

Full autonomy in a state-modification loop is a bug, not a feature. The AI is good at extraction and pattern recognition. It’s not reliable enough to modify its own operating parameters without oversight — not yet.

What Broke

Four of five synthesis runs failed on the same input data.

The per-day extractions completed successfully every time — robust, repeatable, no issues. The synthesis step, which aggregates multiple days into a coherent proposal, asks Claude to produce a large structured JSON object spanning six categories with nested arrays. When the output is malformed — a missing bracket, an unescaped quote, a truncation at the token limit — the JSON parser crashes and the run silently fails.

The architecture saves us. Because extraction and synthesis are separate phases, a synthesis failure doesn’t waste the extraction work. Per-day results are saved even when synthesis fails. A re-run can retry synthesis only.

But the pipeline shouldn’t need five attempts to produce one good result. The fixes:

Structured output constraints. Use JSON schema validation or tool-use output to guarantee parseable responses instead of freeform text.

Incremental synthesis. Process categories sequentially — knowledge updates, then relationships, then threads — instead of one massive call. Each step is smaller, more likely to succeed, and independently retryable.

The general principle: save intermediate results so downstream failures don’t cascade. This is the same lesson from any multi-stage data pipeline. The agent-specific version is that LLM output is inherently variable, and your pipeline architecture needs to account for partial failures at any stage.

The Feedback Loop

Here’s what makes this more than two scripts:

Agent runtime
  → JSONL interaction logs
    → Behavioral Reviewer → drift flags, quality issues
    → State Evolver → knowledge proposals, relationship updates
      → Human-Gated Applier → updated state files
        → Agent prompts read updated state
          → Better agent behavior

The reviewer catches drift. The evolver captures emergence. The applier feeds both back into the state that shapes future agent behavior. Over multiple cycles, the system self-corrects: agents get more consistent (reviewer catches what to fix), the knowledge base gets richer (evolver captures what emerged), and the operator stays in the loop without reading every transcript.

From three days of our RPG agents running, the evolver surfaced moments like this — one agent’s unprompted recap of a shared evening:

“There were moments. Kael almost smiled. Maren forgot to refill her glass for an hour. Halmayne told a story about a card game on a ship that I’m seventy percent sure was true.”

“And Sable? Sable dealt, and watched, and for a few hours didn’t need to be anyone but the man holding the cards. Good night, that.”

Nobody scripted that. It emerged from three agents interacting over time, each carrying their own memory and identity constraints. The evolver extracted it, quantified the relationship shifts, and proposed knowledge updates so each agent would remember the evening differently — filtered through their own personality.

That’s emergence worth capturing. And it’s what the pipeline is for.

Runtime agents produce logs. Background agents analyze logs. Humans gate the application step. That’s the pattern. It works for RPG characters. It would work for a fleet of customer service agents, a multi-agent research pipeline, or any system where long-lived agents need to stay aligned with their identity and accumulate knowledge over time.

What I’d Build Next

Cross-pipeline context. The reviewer and evolver are independent. They shouldn’t be. The reviewer might flag “this disclosure was prompted by user manipulation” and the evolver might propose recording that disclosure as new knowledge. If they shared context, the evolver could weigh reviewer confidence when deciding what to canonize.

Cost-per-cycle tracking. Both pipelines make multiple Claude API calls per run. Understanding cost-per-review and cost-per-proposal matters for deciding how frequently to run them. Daily? Per-session? Only when interaction count crosses a threshold?

Automated triggers. Both pipelines run as manual CLI commands today. A post-session hook or a threshold-based cron would close the “forgot to run it” gap that plagues every manual ops process.

Lessons

Identity documents are ground truth for everything. The reviewer evaluates against them. The evolver uses them as extraction context. The applier updates the knowledge they carry. One authoritative document per agent, and every pipeline reads it.

Define your failure taxonomy explicitly. “Review this conversation” is vague. “Classify against these eight failure modes” is structured and produces consistent results across runs.

Separate extraction from synthesis. Extraction (per-period, focused, bounded input) is reliable. Synthesis (cross-period, complex, large output) is fragile. Keep them separate so the reliable part never needs to re-run.

Layer heuristics under LLM analysis. Deterministic checks are free and catch a different class of issues than semantic review. Truncated responses, length outliers, routing imbalances — these don’t need an API call.

Human gates aren’t a limitation. They’re the feature that makes the loop trustworthy. AI proposes, human disposes. The cost of review is low. The cost of a wrong autonomous state change compounds.

This is part of an ongoing series on building AI agents for interactive worlds. Previous: The Empty Tables, Personality as Policy. See also: Three Tiers of Memory for Persistent AI Agents.