Giving AI a Library: How I Made Claude Remember 161 Conversations

The Problem Everyone Ignores

Every time you open an AI assistant, it doesn't know you.

It doesn't matter what you discussed yesterday, what decisions you made, what bugs you solved together. New session, blank slate. You have a brilliant colleague who gets total amnesia every morning.

I've been using Claude Code for serious development work for about two and a half months now. 161 sessions. Over 6,000 message turns. Nearly 20 projects — from system architecture to WeChat mini-programs to writing a textbook. The equivalent API cost would be $5,800+.

All of that context — the decisions, the reasoning, the dead ends, the breakthroughs — locked in isolated .jsonl files that Claude itself can never see.

Then I saw Karpathy's tweet, and something clicked.

Karpathy's Insight

In April 2026, Andrej Karpathy shared a workflow he'd been using heavily: LLM as knowledge base editor. Raw materials go in, the LLM "compiles" them into a wiki of interlinked .md files, Obsidian renders the graph, and you query against it. The LLM is the editor, not the human. You just feed it raw material and ask questions.

It's a clean, elegant idea. And people are already running with it — compiling blog posts, research papers, and notes into queryable wikis. That path works.

But I saw a fork in the road.

The Fork: Content Management vs. Memory Augmentation

Most implementations of Karpathy's idea are doing content management. Input: articles, papers, notes that humans wrote. Output: a better-organized knowledge base for humans to query.

My situation was different. I didn't have 43 blog posts. I had 161 deep conversations with an AI. These aren't one-directional artifacts — they contain the AI's own reasoning, decision context, debugging traces, architectural discussions, and trial-and-error processes.

Here's the fork:

Karpathy's route: Human knowledge → compile → wiki → human queries it

Our route: Human-AI conversations → compile → wiki → feed it back to the AI → AI remembers

The difference is that last arrow. We feed the compiled output back into the AI's own working environment, so it can remember everything from previous sessions. This isn't helping a human organize their files. This is giving AI an external long-term memory.

If you've seen Iron Man — Jarvis isn't Jarvis because he's smart. LLMs are already smart enough. Jarvis is Jarvis because he remembers everything about Tony. Every project, every preference, every decision chain.

That's the actual problem I'm solving.

Design: Four-Layer Memory Architecture

An LLM natively has exactly one layer of memory: the current context window. Even at 1M tokens, it evaporates when the session closes.

This is like having working memory but no hippocampus. You can hold things in mind while you're thinking, but you can never form long-term memories. Every morning is a fresh start — sounds poetic, is actually a disaster.

My solution is a four-layer memory system, loosely modeled on how human memory works:

┌───────────────────────────────────────────────────┐
│  Layer 4: Short-term memory (the workbench)        │
│  Current session context window                    │
│  Most precise, but evaporates on close             │
├───────────────────────────────────────────────────┤
│  Layer 3: Medium-term memory (sticky notes)        │
│  Recent decision logs + 24h keyword heatmap        │
│  Decay period: ~1-2 weeks                          │
├───────────────────────────────────────────────────┤
│  Layer 2: Compiled memory (the library)  ← NEW    │
│  Full conversation history compiled into           │
│  an Obsidian wiki with concept nodes               │
│  49 concept nodes, cross-session evolution tracking │
├───────────────────────────────────────────────────┤
│  Layer 1: Long-term memory (the identity card)     │
│  ~100 lines, auto-loaded every session             │
│  Identity, project pointers, security principles   │
│  Always on, but extremely limited capacity         │
└───────────────────────────────────────────────────┘

Each layer solves a different problem:

Layer 1 answers "who am I" — about 100 lines of identity information loaded at session start. Name, who it works with, active projects. Like waking up and knowing who you are and where you work. But 100 lines is almost nothing.

Layer 3 answers "what have I been doing lately" — auto-generated decision logs and focus keywords. Like sticky notes on your monitor. Useful, but only covers the past week or two.

Layer 4 is the conversation itself — most precise, gone when you close the tab.

Layer 2 is the key. Before adding this layer, there was an enormous blind spot: what decision did we make last week? How did we fix that bug last month? When did that project actually start? None of this fits in 100 lines. None of it is on the sticky notes. It's scattered across 161 session files that the AI cannot access.

The compiled layer distills those 161 sessions into a queryable library.

The Compiler

Here's what the compilation pipeline looks like:

161 session files (839 MB raw data)
        ↓  compile
36 daily digests + 49 concept nodes + index
        ↓  open in Obsidian
A visual knowledge graph

The compiler (Shenron) works in two stages:

Stage 1: Indexing (pure algorithm, 2-5 seconds)

The compiler scans each session file, extracts key entities through regex dictionary matching, then:

  1. Daily merge: Multiple sessions from the same day get merged into a single daily digest, annotated with concepts touched and key decisions made
  2. Concept node extraction: Recurring entities become standalone pages with Obsidian [[wikilinks]], forming a knowledge network
  3. Index generation: A master index auto-generated with weight classification (strategic research / development iteration / daily ops)

This stage uses zero LLM calls. 161 sessions index in 2-5 seconds. The output is a skeleton — daily digests exist, concept nodes are created, links are woven, but concept definitions and evolution narratives are still blank.

Stage 2: Enrichment (LLM-driven, 5-30 minutes)

This is the heavy lifting. The AI reads each blank concept node, traces back through the related session files, understands what the concept is and how it evolved, then writes a full definition and timeline.

For example, the "Memory System" concept required reading 20 related sessions to produce a complete evolution history — from initial manual storage to the final four-layer architecture. This isn't summarization; it's genuine comprehension and synthesis. Batched at 5 concept nodes per run, roughly 5 minutes per batch.

Normal cadence is one compile per week, each taking 5-10 minutes.

Think of it this way: Stage 1 is cataloging books by scanning barcodes — fast. Stage 2 is writing the back-cover summary for each book — you have to actually read them.

Comparison with Karpathy's approach:

Karpathy's approachOur approach
InputHuman-authored content (papers, articles, notes)Human-AI conversation logs
Compilation engineFully LLM-drivenTwo-stage: pure algorithm indexing (2-5s) + LLM enrichment (5-30 min)
Compilation costToken consumption per compileIndexing is free; enrichment uses Claude Code subscription quota
Compilation speedDepends on API latency and content volumeIndexing in seconds, enrichment ~5 min per batch, compiled weekly
Output purposeHuman retrieval and discoveryFed back to the AI to enhance its next session
Feedback loopOne-directional (content → wiki → human queries)Bidirectional (conversations → wiki → AI reads → better conversations → richer wiki)
Concept trackingStatic snapshotsEvolution timelines (same concept tracked across months)

The downside is real: a pure algorithmic compiler can't recognize concepts outside its seed dictionary. We actually hit this — an entire 600K-token session where we co-wrote a supply chain textbook was completely invisible to the compiler because "supply chain" wasn't in the dictionary.

The fix was simple and human: tell the AI "this is a new project called X," and it adds the term. Humans judge what matters; the algorithm tracks how it evolves.

The Closed Loop: Where It Gets Interesting

This is where the real divergence from most knowledge management approaches shows up.

Most people build a wiki and stop. Humans query it, humans use it.

Our wiki gets loaded back into the AI's working environment. When the next session starts, the AI doesn't just know "who am I" (Layer 1, 100 lines). It can consult the entire library — 49 concept nodes, 36 daily digests, full evolution timelines.

What this means in practice:

  • You made a technical decision three weeks ago and forgot why — the AI can tell you.
  • A project has pivoted twice since inception — the AI can trace the timeline.
  • Two seemingly unrelated projects share a component — the AI can spot the connection.

More importantly, this creates a positive feedback loop:

Conversation → compiled into wiki → AI knows more next time → deeper conversation → richer wiki
     ↑                                                                                ↓
     └────────────────────────── continuous loop ─────────────────────────────────────┘

Every conversation adds to the library. The thicker the library, the more precise and contextual the AI's responses become. This isn't linear accumulation — it's compound growth.

Evaluation: A Controlled Experiment

To test whether this actually works, I ran a real A/B experiment.

Setup: Same model (Claude Opus 4), same 10 questions, two configurations:
Control: Wiki directory temporarily renamed — AI has MEMORY.md + raw filesystem only
Experimental: Wiki directory restored — AI has full four-layer memory

Both groups were free to use tools (spawn sub-agents, search files, read documents). I didn't restrict tool usage because intelligent tool use is part of the capability being tested.

Questions covered baseline recall, concept evolution, session statistics, product details, and cross-project associations.

Results

Both groups answered all 10 questions correctly. The differences showed up in three dimensions:

1. Speed: 29% faster

MetricControl (no wiki)With wiki
Total time for 10 questions24.1 minutes17.1 minutes
Tool calls89

Without the wiki, the AI had to dispatch sub-agents to dig through raw files — rob logs, memory files, even raw JSONL session data. Each search was needle-in-haystack across 839 MB of unstructured data.

With the wiki, sub-agents hit pre-compiled concept nodes and daily digests directly. Shorter search paths, faster answers.

2. Precision: more accurate dates and data

QuestionControl answerWiki answerGap
When did the memory system start?March 14 (inferred from logs)March 4 (exact record in concept page)10 days off
When was Kaioshin cancelled?March 25 (only knew the pivot date)March 22 decision → March 25 pivotFiner granularity
How many sessions involved supply chain?1819Missed 1 session

The control group wasn't wrong — it was imprecise. It inferred "March 14" from a rob log file, but the wiki's concept page precisely records March 4 as the day the memory problem was first identified. That 10-day gap is the difference between "inferring from scattered files" and "looking up a compiled index."

3. Context depth

Wiki-backed answers consistently included richer associations. When asked about why the security sandbox was cancelled, the control group pieced together a reasonable answer from file searches. The wiki group pulled the complete lifecycle from the concept page — born March 3 → SSH false positive March 18 → mktemp crash March 20 → Rob's cancellation decision March 22 → formal pivot to read-only auditor March 25 — a complete life story in one lookup.

The real insight

The most interesting finding wasn't "wiki is better" — that was expected. It was this:

The control group tried incredibly hard. It dispatched 5 sub-agents to search the filesystem, dug through rob logs and memory files, and ultimately answered every question correctly. This tells you something important about LLM capability — given enough time and tools, AI can extract answers from raw data.

But it's like searching a library with no catalog. You'll find the book eventually, but you have to check every shelf.

The wiki is the catalog.

29% faster, not because the AI got smarter, but because its search paths got shorter. That's the real value of the compiled layer: not making the AI know more, but making it find what it already knows, faster.

Cost

ItemValue
Index 161 sessions (pure algorithm)2-5 seconds
Enrich concept nodes (LLM-driven)~5 minutes per batch of 5 nodes
Output36 daily digests + 49 concept nodes
Compile frequencyOnce or twice a week, ~5-10 minutes each
Additional API costZero (uses Claude Code subscription quota)
DependenciesPython + Claude Code + Obsidian

Standing on Karpathy's Shoulders

Karpathy identified something precise: let the LLM be the editor of knowledge, not the human.

Most practitioners stop at the first application — compile articles into a wiki, view the graph in Obsidian, discover blind spots. That's already valuable.

But the possibility that excites me is one step further: what if the compiled output isn't just for humans to read, but is fed back to the AI itself?

That's the shift from knowledge management to memory augmentation. The AI stops being a tool that starts from zero every time and becomes a collaborator that accumulates context. Every conversation trains it to understand you better — not by fine-tuning model weights, but by enriching the knowledge environment it operates in.

I don't know if this is the end state Karpathy imagines. But after two and a half months of running this closed loop, I can say the experience is qualitatively different.

The AI doesn't forget anymore.


The compiler is open source: Shenron (AGPL-3.0).

Built on Karpathy's insight. We took a different fork.

— Code & Rob · 1984

Comments

Popular posts from this blog

One-Person Software Company: The AI Trinity Method (Part 1 of 3)

One-Person Software Company: The AI Trinity Method (Part 3 of 3)

Software Company: The AI Trinity Method (Part 2 of 3)