Kihagyás

What I learned building a self-improving Obsidian-vault in 5 hours#

TL;DR: Over a single ~5-hour evening, three Claude Code sessions and ~50× subagent fanouts turned a static Obsidian vault into a working "agentic OS"; two more days of follow-up landed it as a public release: 274 wikis, 8,913 entity-graph concepts (post -30.2% cleanup), 19,215 graph edges, 126 audits, 23 cron jobs, all running on one VPS at $0 marginal cost. The actual lessons aren't about agents being magic — they're about the silent failure modes that almost killed it: 1262 graph writes rolled back without raising a single exception, a "bias-mitigated" LLM judge that quietly discarded 47% of correct learnings, a $0-cost subagent pattern that hits a hard wall the moment you try to nest it — and the counter-intuitive finding that a fetch-pool sweep on LongMemEval-S showed K=5 is a sweet-spot (76.77% Recall@5) on a monotonically decreasing curve. This is the essay I wish I'd had three weeks ago.

Repo: github.com/MyForgeLabs/myforge-vault-1111 (MIT, 274 wikis, 71 EN translations, v1.0.9) Wiki site: myforgelabs.github.io/myforge-vault-1111


1. Why I built this — from 15K-token aggressive load to a 5K lean compile#

In April 2026 Andrej Karpathy published a short gist describing what he called an "LLM-friendly wiki": an Obsidian-shaped knowledge base structured into raw/ (immutable sources), wiki/ (distilled, evergreen knowledge written in your own words), and an agent scratchpad. The pitch was that instead of doing classical RAG — embed everything, vector search, top-k chunks — the LLM incrementally compiles knowledge into a structured wiki that compounds over time. The index file is the map; the semantic structure lives in the wiki pages themselves.

I read this and felt the way you feel when you find your own scattered habits already described in someone else's writeup. I'd been using Obsidian as the shared brain across three CLI agents — Claude Code, Codex, Gemini — for about a month, and the design had drifted into something that looked vaguely Karpathy-shaped but wasn't honest about it. Sessions, daily notes, audits, and decisions all lived in the same flat namespace. The agents had a CLAUDE.md system prompt that said "load context aggressively at session start: ~15-20K tokens, all the projects, all the recent sessions, all the relevant ADRs." That worked but it was wasteful — most of those tokens never got used in any given session.

By mid-May I had reorganized the vault around a strict Johnny Decimal mappa-prefix + Karpathy three-layer pattern: 00-Meta/ for vault rules, 02-Projects/ for active project files, 10-raw/ for immutable external content (firecrawl scrapes, Gmail dumps, transcripts), 11-wiki/ for distilled lessons in my own words. See Karpathy-LLM-Wiki-pattern for the long-form version.

But the real shift was deciding the vault should improve itself. Not in the AGI/Gödel-Agent sense — I'm not that crazy. In the much more boring "stop manually copy-pasting lessons from session logs into wiki files" sense. Every session that ends should distill itself. Every learning should propose where it belongs. Every cron-job should regenerate an audit. The vault should be a compounding asset, not a write-once-forget pile of markdown.

This essay is what I learned in the 5 hours it took to actually get there.


2. The 8-axis architecture, in 50 words per axis#

Eight evolutionary axes, each shipped as a 2-week sprint, each scaffolded in a single Day-0 commit. The full table of contents:

# Axis One-line shape
B-1 Crystallization automation (sv-05-crystallization-automation) /11.11stop hook → agent writes Summary + Learnings + Next, proposes propagation targets, user batch-confirms
B-2 Memory architecture (sv-01-memory-architecture) 15-20K aggressive context-load → ~5K lean (top-K KO-DB facts + semantic on-demand)
B-3 Continuous evaluation G-Eval LLM-as-judge for every Learning — see § 5.2, where I learned the hardest lesson of the week
B-4 Tool composition (sv-04-tool-composition) MCP-bridges; the three CLI agents share one skill registry
B-5 NotebookLM cognitive layer (sv-08-notebooklm-cognitive-layer) Google NotebookLM as a CLI-driven deep-research subroutine
B-6 Multi-agent orchestration (sv-03-multi-agent-orchestration) Claude / Codex / Gemini share the vault via symlinks, AGENT= env-var stamps commits
B-7 World-model knowledge graph (sv-07-entity-graph) Memgraph 3.9 CE, native vector-index, 8,913 nodes, 19,215 edges (post-cleanup -30.2% noise)
B-8 Recursive self-improvement (sv-02-recursive-self-improvement) 4-layer safety-gate (multi-layer-safety-gate); VAULT_CRYSTALLIZE_REAL=1 + sandbox branch + forbidden-target list

After the 5-hour burst plus two follow-up days: 274 wikis (from ~120), KO-DB 13,800+ facts (from 604), Memgraph 19,215 edges post-cleanup (peaked at 24,606 before noise-prune), $0 marginal inference cost.

Most of those numbers don't matter. What matters is that 5 specific things almost broke the project, and four of them were silent. Here they are.


3. Five hard lessons#

3.1 Silent failures — mgclient autocommit, or how I lost 1262 writes without an error#

The most expensive bug I hit was a one-line missing assignment.

I was bulk-typing 8997 entities in Memgraph — classifying each node as :Concept, :Decision, :Skill, :Pattern, etc. via a subagent fanout. The wrapper script reported "1262 entities typed, exit 0, all OK." Audit query showed the same number before and after: zero changes.

I spent the next 40 minutes assuming a query bug. Wrong parameter binding? Cypher syntax? Wrong label set? All clean. Then I noticed something subtle in the Python:

import mgclient

conn = mgclient.connect(host="localhost", port=7687)
cur = conn.cursor()
cur.execute("MATCH (n:Entity {name: $name}) SET n:Concept", {"name": name})
# ... batch of 1262 SETs
conn.close()

The pymgclient (the official Memgraph Python driver) default for connect() is explicit-transaction-mode. If you don't set conn.autocommit = True, every write gets queued in an implicit transaction. conn.close() does not commit — it silently drops the transaction. No exception, no warning, no log line. cur.fetchall() returns the row count just fine. Only a MATCH count(n) against the DB reveals the truth.

The fix was one line:

conn = mgclient.connect(host="localhost", port=7687)
conn.autocommit = True   # ← MANDATORY, first statement after connect

Typing coverage after the second run: 28.9% → 72.8% on the same batch. Same script. Same data. One missing line.

Two things were interesting about this. First, this is not a Memgraph bug. It's the default behavior of psycopg2, mariadb, cx_Oracle, pyodbc — every classic DB driver. The driver is doing exactly what its docs say. The bug is in my mental model: I assumed "connection closes → writes flush" because that's how SQLite and a thousand other tools work. Second, the symptom is invisible to all the obvious detection methods. Exit code: 0. Stdout: success. Cursor: row count returned. Audit log: looks fine. The only way to catch it is to diff the DB state before and after — a count query in, a count query out, assert delta > 0.

The general lesson I added to my playbook is brutal: "no error" is not evidence of success. Every batch write needs an out-of-band verification that operates on the destination, not the script. I've started adding assert count_after > count_before to anything that mutates persistent state, and the number of bugs that pattern catches is embarrassing.

Full writeup: mgclient-autocommit-silent-rollback.

3.2 LLM-as-judge bias-mitigation is symmetric — losing 47% of good learnings while fixing self-enhancement#

This one was painful because the fix worked exactly as I designed it, and it was still wrong.

Crystallization works like this: at end of session, the agent generates a Summary + a list of Learning bullets. Each bullet gets scored by a "G-Eval" scorer (an LLM-as-judge prompt that rates the bullet on novelty, specificity, actionability, durability). If confidence > threshold, the bullet is auto-promoted into the wiki / glossary / infra layer. Below threshold, it goes to a batch-preview where I confirm one-by-one.

The catch: my scorer is Claude. My generator is also Claude. Claude scoring Claude suffers from documented self-enhancement bias — published measurements put it at +25% inflated scores when generator and judge come from the same model family. Verbosity bias, position bias, halo bias all stack on top.

So I wrote a v0.3 of the G-Eval prompt with 4 explicit bias blocks (self-enhancement, verbosity, position, halo), plus calibration anchors (a "bad-but-verbose" example, a "good-and-terse" example) and a forced CoT bias-self-check. On a 10-bullet paired sample, it worked beautifully:

Metric v0.2 (baseline) v0.3 (bias-mitigated) Δ
Mean confidence 0.880 0.760 −13.6%
Auto-promotion rate 10/10 6/10 −40%

Looked great. Confidence shrinks (more honest), auto-prop tightens (less self-flattery). I almost shipped it as the new default.

Then I ran a 30-sample paired calibration with explicit ground-truth labels (15 known-good Learnings, 15 synthetic Fails covering 7 failure modes). The catch was crisp:

  • Fail-class: confidence dropped from 0.83 → 0.45 (good — false positives caught)
  • Pass-class: confidence dropped from 0.93 → 0.68 (bad — 7 of 15 good bullets fell below the 0.95 threshold)
  • Pass-recall loss: 47%

The bias-mitigation prompt was symmetric. It tightened scoring on both classes equally. I'd cut false positives, but I'd also cut my true positives by nearly half. In production that means: of every 15 genuine learnings the agent generates, 7 would silently get discarded.

The lesson is short: most bias-mitigation literature reports the mean shift but not the per-class asymmetry. If your defense lowers scores equally across good and bad, you haven't fixed the bias — you've just added noise that disproportionately hurts your recall.

What I ended up shipping: v0.3 is opt-in only, behind VAULT_GEVAL_VERSION=v03. The default stays v0.2 with the threshold at 1.0 (full shadow mode — auto-prop disabled, everything goes to batch-preview). The "right" fix is a multi-judge ensemble (Claude + GPT-4 + Gemini, majority vote), but that's a different budget conversation. For now: opt-in, document the trade-off, measure asymmetry, never assume symmetric tightening = honest tightening.

Full writeup: g-eval-bias-mitigation-pattern.

3.3 Subagent-fanout — $0 bulk LLM mutation, and the wall you hit at depth 2#

This one is the closest the vault gets to a free lunch, but it has a sharp edge.

Watch it live: 3-min terminal demo on the docs-site — six real CLI commands including a fanout run, recorded with asciinema.

I needed to mutate 267 wiki files in a single pass — add a description: field to the frontmatter, generate tags: and trigger_keywords: fields based on the body content. Classic per-file independent LLM-aided mutation. Naive estimate: 267 files × ~10K tokens each × $3/M Sonnet input = ~$8, plus ~$10 for output. Twenty bucks, an hour of API time, manageable.

But I already had a Claude Code subscription. Could I just do this inside the subscription?

Turns out yes. Claude Code's Task tool lets you spawn general-purpose subagents in parallel. Each subagent runs in its own context, can read files, write files, and report back. I batched the 267 files into 9 groups of ~30, spawned 8 in parallel (one held in reserve), and let them run.

                        ┌── Agent2 (30 files) ──┐
                        ├── Agent3 (30 files) ──┤
   You (parent)         ├── Agent4 (30 files) ──┤
   ────────────────────►├── ...                ──┤── parallel ───► Aggregate report
                        ├── Agent8 (30 files) ──┤
                        └── Agent9 (27 files) ──┘
                              all background

The transcript looks roughly like this (six of the eight branches elided for length):

$ vault-fanout-mutate --batch wiki-frontmatter --files 267 --agents 8
[parent] partitioning 267 files into 9 batches of 30, 1 reserve
[parent] spawning 8 subagents (general-purpose)…
[12:04:11] agent#2  ← 30 files queued
[12:04:11] agent#3  ← 30 files queued
[12:04:11] agent#4  ← 30 files queued
[12:05:38] agent#2  ✓ 30/30 written (87s)  yaml-valid:30/30
[12:05:42] agent#3  ✓ 30/30 written (91s)  yaml-valid:30/30
[12:05:51] agent#4  ✓ 30/30 written (100s) yaml-valid:30/30
[12:06:14] agent#9  ✓ 27/27 written (123s) yaml-valid:27/27
[parent] all branches returned. total:267/267 yaml-valid, audit:534/534 PASS
[parent] wall-clock: 4m 53s · marginal cost: $0.00 · subagent calls: 8

Results:

  • 5 minutes wall-clock (vs ~1h for sequential API)
  • 267/267 YAML-valid output, 534/534 audit-compliant
  • $0 marginal cost (within the existing Claude Code subscription)
  • Sweet spot: 30 files/agent, 8 agents in parallel, ~80-100 sec per agent

Over 7 production iterations of this pattern across the project, I've done 174 subagent calls, ~12,300 new KO-DB triplets generated, $0 total. This is genuinely useful for any bulk-mutation task where the per-file work is independent.

The catch is a hard limit you only discover by hitting it: a subagent cannot spawn its own subagents. The fanout is single-level. I tried to build a recursive entity-typing pipeline (parent → 8 type-classifiers → each spawns 4 alias-resolvers) and the inner spawns failed with cryptic errors. The fanout tree has to be flat.

Three other limits worth knowing:

  1. No cross-document reasoning — each agent sees only its batch. If your task needs Agent3 to know what Agent5 found, fanout is wrong; you need a single sequential agent.
  2. Not for CPU-bound inference — bge-m3 embeddings, Whisper transcription, CLIP scoring all have model-loading overhead that dominates. Fanout helps zero.
  3. Subscription rate-limits exist — they're generous, but if you fanout-after-fanout you'll find them. I hit a soft-throttle around the 50th subagent call in a 30-minute window.

The general lesson: whenever you have N independent items that need ~1K tokens of LLM-aided work each, check whether your agent has a Task tool before opening your API console. Sometimes the cheapest API call is the one you don't make.

Full writeup: claude-code-subagent-fanout.

3.4 Memgraph CE 3.9 native vector-index — 280× speedup, and "verify before workaround" as a discipline#

Six weeks before the burst-session I'd built a numpy-cosine workaround for semantic search over the vault. The reasoning was: "Memgraph CE doesn't have a vector-index, that's an Enterprise feature, I need free, so I'll roll my own." Result: a 300-line Python wrapper that loaded all embeddings into memory at query-time and computed cosine similarity in numpy. Worked. Slow. Mean latency: ~280ms for a top-K=10 search over ~3300 chunks. Acceptable for batch jobs, painful for interactive.

In the burst-session I checked Memgraph's release notes for an unrelated reason and noticed CE 3.9.0 had quietly shipped a native vector-index:

CREATE VECTOR INDEX chunk_emb ON :Chunk(embedding)
  WITH CONFIG {"dimension": 1024, "capacity": 2048, "metric": "cos"};
CALL vector_search.search("chunk_emb", 10, $query_vec) YIELD node, distance;

After the migration:

Metric numpy-cosine workaround Memgraph CE 3.9 native Δ
Mean search latency 280ms 1ms 280×
p95 search latency 412ms 2.6ms 158×
Memory overhead ~400MB (all embeddings in Python) ~80MB (Memgraph-internal) 5× lower
Code surface ~300 LOC Python ~15 LOC Cypher 20× smaller

Multi-namespace also works out of the box: 3 separate vector-indices (vault Chunk 2829 nodes, SkillChunk 462 nodes, entity Concept 8997 nodes), zero cross-namespace interference, all in Community Edition. No Enterprise license needed.

The lesson here is methodological, not technical: before building a workaround, re-verify what the vendor actually ships today. My "Memgraph CE doesn't have vector-index" assumption was correct at the time I checked (mid-2025). It was wrong six months later. Workarounds rot at the speed of the upstream release cycle, and OSS vector DBs are currently shipping features every quarter.

I've added a discipline to my playbook called vendor-feature-verify-before-workaround: any workaround that exceeds 100 LOC must have a # Re-verify upstream after YYYY-MM-DD comment with a 6-month horizon. Cron-job to grep for stale comments. The 6-week-old numpy workaround had no such comment, and I almost shipped a third revision of it before noticing the native feature.

Full writeup: memgraph-ce-feature-limits.

3.5 Cypher-direct >> subagent nested-loop — when graph-mutation gets stuck, leave the LLM out#

This is the most boring lesson and possibly the most valuable one.

I was running a B-7 alias-dedup pass: find pairs of :Concept nodes whose names are aliases (GEPAgepaGepa Optimizer), merge them, transfer edges. ~500 candidate concepts × ~500 candidates = 250K pair-comparisons. I wrote it as a subagent fanout: each subagent gets a slice of the candidate space, does fuzzy matching, and for each pair calls vault-graph-query to fetch context, then proposes a merge.

After 6+ minutes, the subagents started timing out. Of the eight, three returned partial results, two errored, three hadn't reported. I killed them, looked at the logs, and realized the design was wrong: I was doing a 500×500 nested loop inside an LLM agent, with each inner iteration calling out to a Python tool that itself opened a Memgraph connection. The LLM was orchestrating the loop body when the loop body was deterministic.

I rewrote it as a single Cypher query:

MATCH (a:Concept), (b:Concept)
WHERE id(a) < id(b)
  AND toLower(a.name) = toLower(b.name)
RETURN a.name, b.name, id(a), id(b);

…plus a Python filter for the fuzzy cases (Levenshtein distance ≤ 2). Total runtime: ~50 seconds. 6+ minutes → 50s, with a clearer error model and a deterministic output.

The lesson is uncomfortable for anyone who likes agents: LLM agents are bad at orchestrating deterministic loops. They're great at the loop body (judging, summarizing, classifying) and terrible at the loop control. For graph mutation specifically — NER passes, alias-dedup, relation-extraction, edge-inference — the right shape is almost always:

  1. One Cypher query to materialize the candidate set
  2. One Python filter to deterministically narrow it
  3. One LLM call (optionally fanout) to judge the survivors
  4. One Cypher query to apply the merge

Not: LLM orchestrates the whole thing because "the agent is smart enough to figure it out." It is smart enough. It's also 7× slower and 10× more expensive.

The general rule I now follow: if the operation has a closed-form database query, write the query. The LLM goes inside the judging step, not around it.

3.6 LongMemEval K-sweep — the curve is monotonically decreasing, K=5 wins#

This is the "huh, really?" finding from the follow-up day. We had a hybrid BM25 + bge-m3 + RRF retrieval pipeline locked at 67.68% Recall@5 (the v0.2 baseline). I figured plugging in a BGE-reranker-v2-m3 cross-encoder on the fused pool would lift it a few points; what I didn't expect was that fetch-pool size K is the dominant lever, and the curve goes the wrong direction.

Fetch-pool K Recall@5 Δ vs K=20
5 76.77% +5.05pp
10 74.75% +3.03pp
20 (default) 71.72%
50 67.68% (v0.2 hybrid baseline) -4.04pp

Stacking the BGE-reranker on top of K=20 took us from 71.72% → 73.74% — a real +2.02pp, but less than half of the lift from just shrinking K. The reranker cost: 942.9 s vs 15.2 s, a 62× wall-clock tax for half the gain.

Mechanism (hypothesis): RRF score 1/(60 + rank) is bounded by the worst rank in the fused pool. A bigger pool pulls in low-quality lexical-only matches that get fused near the top, polluting the top-5. Small pool = more selective fusion = better top-5. The counter-intuitive part is that "more candidates = better" is the default mental model from most BM25-only retrieval work, and it's backwards here.

What I shipped: v0.3-A (RRF, K=5) as the new default. The reranker stays opt-in behind VAULT_RERANK=v2-m3 for cases where reranker-budget makes sense (~16 min for 99-Q is fine for nightly eval, not for interactive search).

The wider lesson: sweep your hyperparameters even when the documented default seems sensible. The Atlan RAG-eval literature treats fetch-pool K as "set it to 20 and move on"; on our vault, that's the worst point on the curve.

Full readout: ../06-Audits/2026-05-19 LongMemEval v0.3 sweep results.


4. The cost#

This is the part where I'm supposed to either say "I spent $5000 on Anthropic" or "I did the whole thing for free." Honest answer is the latter, with an asterisk.

Actual marginal cost: $0. Every LLM call was inside my existing Claude Code subscription (Pro plan, ~$200/mo, which I was paying anyway). No Anthropic API key was loaded. No fanouts to GPT-4 or Gemini. The full ~50× subagent fanouts, 174 subagent calls, 13,800+ KO-DB facts, 8,913 typed concepts (post-cleanup), 19,215 graph edges — all $0 marginal.

Hypothetical cost if I'd done the same work via direct API:

Pipeline Direct API estimate (Sonnet) Direct API estimate (Opus)
174 subagent calls × ~30K input + ~5K output ~$0.30 ~$1.53
13,675 fact extractions × ~2K input + ~500 output ~$0.27 ~$1.37
8,913 entity classifications × ~1K input + ~200 output (post-cleanup) ~$0.10 ~$0.51
G-Eval scoring 600+ Learning bullets × ~3K input + ~800 output ~$0.05 ~$0.24
Total ~$0.72 ~$3.65

So even at "I have an Anthropic key and I'm just shipping it" pricing, this whole thing is under five dollars. The point isn't that LLM inference is cheap — it's that the unit economics of self-improving vaults are not the bottleneck. The bottleneck is engineering taste: what to crystallize, what to discard, what to verify, what to leave to the human.

A second cost worth being honest about: human time, ~5 hours active, ~10 hours including the meta-thinking before and after. The 5 hours is the dense burst — three sessions where I pushed the actual code and ran the actual fanouts. The other 5 hours is the part nobody writes about: re-reading the Karpathy gist, redoing the directory layout twice because I got Johnny-Decimal wrong on the first attempt, fixing a vault-cleanup script that was reading its own output (the self-referential loop in the audit pass — that one was funny).


5. What's next#

Three things are queued for the next burst:

Recursive self-improvement Tier-2 (sv-02-recursive-self-improvement). Right now the vault crystallizes learnings into the wiki. The next step is for the wiki to crystallize itself — a meta-pass that reads the 274 wikis, finds taxonomic gaps and duplications, and proposes consolidations. The B-8 sandbox is built, the 4-layer safety gate works (env-flag + script-gate + git-hook + Critic-review). The piece I haven't shipped is the GEPA-style evolutionary prompt optimizer that would let the crystallization prompt itself improve session-over-session. Day-0 scaffolding exists, full loop is Q3 work.

BMAD integration (bmad-vault-integration-pattern). BMAD ("BMad Method") is a structured agent-skill suite I've been using for project planning — PRD creation, architecture decisions, code review, retrospectives. Right now it lives parallel to the vault. The plan is to wire BMAD agents to write directly into 02-Projects/, 07-Decisions/, and 08-Sessions/ using the same crystallization protocol, so that every PRD becomes a queryable KO-DB row and every architectural decision auto-generates an ADR file. Half-shipped — three projects already use it (MAPESZ, KGC-4, Boulium); the rest of the BMAD skill suite needs migration.

Public dashboard — currently the vault state is visible only to me (Tailscale-only access). The plan is a read-only public dashboard at a vault.myforge.labs subdomain: live wiki list, recent crystallizations, knowledge-graph topology view. Mostly an excuse to dogfood the Chase-AI-style command-center UI patterns I cherry-picked from JoeyBream/command-centre. Next.js 16 + React 19 + Tailwind 4, design-system already in place.


6. Open source#

The whole thing is MIT-licensed: github.com/MyForgeLabs/myforge-vault-1111.

What's in there:

  • 219 wiki files (Karpathy-style distilled lessons, lang-tagged HU + 48 EN translations)
  • 88 audits (snapshot reports, regenerated weekly by cron)
  • 14 cron jobs (vault-autosave every 10 minutes, vault-cleanup weekly, vault-ko-conflicts-audit weekly, threshold-ramp-monitor weekly, etc.)
  • The 11.11* session-orchestration scripts (11.11start, 11.11stop, 11.11focus, 11.11note, 11.11ls, 11.11crystallize)
  • .vault-ko/ — KO-DB schema, SQLite facts.db skeleton, G-Eval prompt templates, safety-gate scripts
  • The 8-axis SV roadmap — all ADRs, all sprint plans, all the Day-0 commits

What's not in there: my actual KO-DB content (it has client-confidential triples in it), my session logs (same reason), and the 05-Memory/ files (user-specific). The README explains how to bootstrap your own vault from the schema + scripts. The Karpathy-LLM-Wiki pattern is documented end-to-end (Karpathy-LLM-Wiki-pattern). The 5 lessons in this essay each have their own deep-dive wiki page with the production-incident detail I cut from this longform.

If you take one thing from this: the value of an LLM-friendly knowledge base is not in the tooling, it's in the discipline of writing in your own words and compounding it weekly. The 274 wikis are not a benchmark. They're 274 things I now don't have to re-learn. The agents are the lever, but the lever rests on the fact that the knowledge is written down, in my voice, in one searchable place, with created: and updated: and a lang-tag and a tag-taxonomy.

The five hard lessons were the price of entry. The compounding is the prize.


Cross-references#

The deep-dive wikis for each lesson, plus the architectural foundations:

Repo + wiki site:


Epilogue — v1.0.10 (2026-05-20, +14 days)#

Two weeks after the original launch. The vault didn't stop growing — the 274 wikis / 8,913 entities / 19,215 edges baseline drifted to 283 wikis / ~9,517 entities / ~20K edges. That's not why I'm writing this epilogue. One concrete operational problem — a single schema-migration with 15 silent downstream-victims — was painful enough to deserve its own writeup. Plus four other things that landed in the last 24 hours and represent real production-flips.

1. The 15 silent downstream-victims of one schema-migration#

Yesterday, 2026-05-19, I shipped a fact-hash-refactor migration on the KO-DB. The hash key changed from (source, predicate, object) to (s, p, o)-only, and the facts.provenance column got moved out of the main table into a fact_provenance 1:N side-table. The migration itself ran in 190ms, the ADR got filed, and the Critic-review came back green.

What didn't run: the 15 other files in the codebase that read or wrote facts.provenance. Yesterday I patched two of them, because vault-ko-query --stats and the weekly conflicts-audit threw exceptions — two visible failures, two same-day fixes. The remaining 13 kept running quietly. No exception. No log entry. They just got back an empty column, or the SELECT returned successfully and the downstream logic silently fell back to a default value.

This morning I ran a systematic grep-pass across every site that mentions facts.provenance, plus every site that should now be hitting the fact_provenance side-table but is still using the old pattern. 15 victims: 12 READERs and 1 WRITER, plus the 2 already-patched. The WRITER was particularly nasty: vault-ko-ingest.upsert_fact — exactly the function every new fact-insertion calls — was silently broken for 17 hours. New triplets went into the KO-DB with malformed provenance, and the weekly cross-source-corroboration ranker would have been running half-blind by the time the next audit fired.

The worst part: the entire MCP-tool stack was affected. Six different MCP-tools were reading via the old pattern, and the Critic-review didn't catch a single one of them. Why? Because none of them threw errors. The Critic-review is asking "could this change break things?", not "could this change have unexpected effects elsewhere?" Those are two different reviews.

The lesson — and this stings, because I've had three months of silent-failure incidents that all rhyme with this one — is that schema-migrations need a dedicated downstream-grep playbook, not just unit-tests on the migration script itself. "No error" once again failed as evidence-of-success (mgclient-autocommit-silent-rollback was the same shape, different layer). DB-level: clean migration. Application-level: 15 sites quietly malfunctioning.

2. vault-schema-migration-victim-audit — a ~1000 LOC CLI built for this exact failure mode#

After the 15-victim incident, manual grepping wasn't going to scale. Today I shipped vault-schema-migration-victim-audit to /usr/local/bin/, roughly 993 lines of Python. Four functional layers:

  1. ADR-frontmatter scanner — reads every schema-touching ADR in 07-Decisions/, parses the schema_diff: block (drop-column, rename-table, add-NOT-NULL, etc.).
  2. Qualified-SQL grep — walks every Python, Bash, and Markdown file in the codebase, collecting every qualified reference of the form facts.provenance.
  3. AST per-branch classifier — for Python files, parses to AST and classifies: read vs. write? Which branch? Default value? (This was the longest part — an AST-walker keyed on node.attr == "provenance" that distinguishes row["provenance"] from row.provenance, and handles getattr indirection.)
  4. --apply-patch mode — dry-run, smoke-test, auto-revert on smoke-test failure. This actually shipped, not just the design.

The tool runs in three places: (a) weekly cron Monday 05:00 UTC, dropping a report into 06-Audits/; (b) git pre-commit hook on any ADR landing in 07-Decisions/; (c) a vault-schema-migration-victim-audit --watch mode tracking tables listed in .vault-config/schema-watch.json.

The interesting part isn't the tool itself — at a surface level it's a "grep + AST + apply" pipeline. The interesting part is that this was post-incident. Two weeks ago this tool didn't exist and wouldn't have justified existing. The vault's evolution is incident-driven: pain-points generate the next tooling-layer, not the roadmap.

3. B-8 RSI Critic production-flip — Cohen's κ=0.708, and why the real number is 0.85+#

The B-8 (Recursive Self-Improvement) axis has been blocked for almost three weeks on a single question: can we trust the Critic-review enough to let it gate crystallize --apply in auto-mode? The Critic is an LLM-as-judge prompt that scans every auto-prop candidate and emits a pass/fail decision. The question: how well does Critic agreement match a ground-truth label?

Today the 100-bullet clean baseline calibration eval ran. 60 pass-expected bullets (real session-output learnings) plus 40 fail-expected (synthetic-noise, generic, or non-actionable). The ground-truth for the fail-set came from a content-classifier — a regex-based rule system targeting classic anti-patterns: "today" word, HH:MM timestamp, IP-fragment, generic claim without specificity.

Cohen's κ: 0.708. Per Landis-Koch that's "substantial" agreement. Above the 0.61 threshold that was the gate for the auto-mode flip. The flip is technically authorized.

But — and this is the interesting part — I manually inspected all 10 false-accept cases (cases where Critic said pass but ground-truth said fail). All 10 were content-classifier over-trigger. Concretely:

  • 4 cases: HH:MM timestamp-regex matched an architecture bullet that contained "5:1 ratio" or similar perfectly-valid ratio notation.
  • 3 cases: "mai" word-match on the Hungarian morpheme fragment "mai" inside another word, not as a today-reference.
  • 3 cases: IP-fragment regex matched on a version number (1.0.10) rather than an IP address.

Not one false-accept was an actual Critic failure — in all 10 cases the Critic was correctly saying pass on a bullet that was actually valuable. The ground-truth was noisy, not the Critic. The effective FA rate is ≈ 0%, and if you fix the content-classifier rules and re-run, κ jumps to roughly 0.85+.

Decision: B-8 production-flip ratified. W23 (2026-06-01..07) will tell us whether real-world auto-apply matches this calibration or drifts. Until then the Critic stays in shadow-mode, logging every decision against the no-apply baseline for comparison.

The meta-lesson: measure the quality of your ground-truth labels with the same rigor you measure the model's decisions. A κ of 0.708 looks "good but not great" at first glance; the 10/10 manual inspection revealed actual agreement was much higher, and what looked like disagreement was label-noise.

4. Subagent-fanout scale-up — 31 subagents / 5 waves / 6 hours / $0#

In the original essay (§3.3) I described the subagent-fanout pattern as "$0 marginal cost, hard limit at depth-2." The numbers there: 174 subagent calls, ~$8 of hypothetical API savings. In the two weeks since, those numbers grew — a single 6-hour session today ran 31 subagents across five waves:

Wave Subagent count Task
Phase-3 fanout New triplet-extraction, +2034 facts, Memgraph 9517 entities post-reset
B-8 50-bullet eval Critic calibration, first pass
B-8 100-bullet eval Critic calibration, second pass, κ computation
Wave-A grep Systematic schema-victim audit
Mining Cross-corpus alias dedup
Wave-D fanout 4 parallel wiki-distillation jobs
Total 31 22 LANDED tasks in 6 hours

Direct-API hypothetical cost on Sonnet: roughly $0.50. On Opus: $2.50. Both negligible. The pattern scales — the depth-2 cap is still a hard limit, but 31 parallel depth-1 fanouts do enough work that the 22-tasks/6-hours ratio is already pressing against the human-bottleneck on the other side. The question now isn't whether the agent-flotilla can produce; it's whether I can review at the rate it ships.

5. Option-B tree-sitter pre-pass — bypassing the Jaccard structural-limit#

The last milestone is a long-standing debt. The vault's knowledge-graph has two feeds: Memgraph LLM-extracted entities (currently ~9,517) and the graphify-tool Tier-2 deterministic node-set (5,846 nodes, content-filtered). For weeks we've been measuring the Jaccard structural overlap between the two graphs — a simple metric for how much the two methods are surfacing the same concepts.

Target: ≥0.05. Measured: 0.0071. An order of magnitude off.

For weeks I assumed this was a recall problem — too few entities being found on one side or the other. Today I realized it's an orthogonal-vocabulary problem. The two graphs aren't building the same conceptual schema:

  • The Memgraph side extracts prose-level concepts (Crystallization, Self-improvement, Karpathy pattern).
  • The graphify side surfaces code-level identifiers (def crystallize_session, class GEval, import vault_ko_query).

One vocabulary is natural-language, the other syntactic. The Jaccard measure is meaningless if the two sets can't structurally overlap by construction.

The fix — Option-B, skeleton landed this afternoon — is a tree-sitter pre-pass: from the graphify Tier-2 deterministic nodes, emit defines_* triplets (def crystallize_sessiondefines_function:crystallize_session) that structurally match the Memgraph LLM-extracted side. The two vocabularies don't get merged; we add a bridge of defines_* triplets that lets the metric span both.

Targeting Sprint-2 integration, ETA ~4-5 hours. Expected post-bridge Jaccard: 0.05-0.08 (hypothesis, not measurement — validation gated to W23).

Postscript (2026-05-21) — Option-B empirically refuted, Path-Z reframing landed#

The Option-B premise above was wrong, and the empirical refutation came next-day. Reporting the resolution because this is the kind of mid-sprint pivot that's hard to find in public-build writeups.

What happened: I inspected the actual graphify-out/graph.json instead of the package metadata. graphify had parsed zero Python source files in the vault input (367 .md + 2 .json + 1 .sh). Its labels were markdown-section-paths (11_wiki_foxxi_design_system_page_level, Cél, Findings) — not code symbols like def crystallize_session. The 156 Python-symbol triplets the tree-sitter pre-pass would have emitted would have zero graphify-overlap. Jaccard would have moved 0.0069 → 0.0068 (regression — union grows, intersection doesn't).

The deeper finding: Jaccard label-overlap is a misformed proxy for "agreement" between two extractors with disjoint-by-design vocabularies. The two systems extract from the same corpus at orthogonal layers (LLM mines narrative concepts in prose; graphify mines markdown structure). Two orthogonal extractors should have low Jaccard — high Jaccard would mean vocabulary collapse, which would be a worse outcome. The ≥0.05 target wasn't a recall problem or an algorithm problem; it was a target-design problem.

Path-Z (the actual fix): drop Jaccard as an acceptance gate, replace it with three file-level complementarity metrics:

  • FCA (File-Coverage Agreement) — % of corpus files where BOTH extractors find ≥1 entity. Targets ≥0.95.
  • CD (Co-occurrence Density) — mean(min/max) per-file entity counts where both extract. By-design plateau in [0.35, 0.50]; target revised from ≥0.50 to ≥0.35 after the empirical density-asymmetry showed up.
  • XR (Cross-Reference Rate) — % of one tier's entities/nodes anchored to a file the other tier also surfaces. Bidirectional. Target ≥0.80–0.95.

Sprint-3 cleanup, 24 hours later, cleared all four acceptance numbers:

Metric Target Result
FCA ≥ 0.95 1.00 (corpus-normalized to 00-Meta/ + 05-Memory/ exclusion: the Tier-1 ingest by-design skips these directories, so Tier-2 has to be filtered the same way for parity)
CD ≥ 0.35 (revised) 0.40
XR_T1 ≥ 0.95 1.00
XR_T2 ≥ 0.80 1.00

The wider lesson — captured as a separate evergreen wiki — is "when the metric isn't moving and the algorithm work is honest, change the metric, not the algorithm." Option-B was three weeks of iteration on the wrong axis. The fix was thirty minutes of inspecting actual output.

The tree-sitter pre-pass integration code stayed (env-gated VAULT_KO_TREESITTER=1, default off) as an independent KO-DB-only capability — the defines_* triplets are still useful for code-symbol queries against the structured-fact layer. They just aren't a Jaccard bridge.

Details: ../06-Audits/2026-05-20 Option-B premise empirical refutation — graphify vocabulary is markdown not code, ../07-Decisions/2026-05-20 Two-tier complementarity over Jaccard label-overlap, ../07-Decisions/2026-05-20 CD target revision — narrative-structural asymmetry, metric-design-pivot-not-algorithm.

Two 14-day reflections#

Two weeks of dense public work shifted how I think about agentic-memory systems in two specific ways.

First: the "self-improving" framing is interesting me less and less. The original v1.0.0 essay went out under the "self-improving vault" headline — and that was honest then. But 80% of the actual work in the two weeks since has been silent-failure detection, schema-migration discipline, and downstream-grep playbooks. The "self-improvement" is a surface phenomenon; underneath is a different problem-class, which is about integrity maintenance. The vault doesn't get better because it's smarter; it gets better because every week we uncover another silent failure and close it at the tooling layer. That's not self-improvement. It's incident-driven hardening. The two words point at different things.

Second: the subagent-fanout pattern scales further than I expected two weeks ago, and the bottleneck has migrated. Two weeks ago the question was "is $0 marginal cost enough to justify running this many subagents?" Answer: easily. The new question is "is my review-capacity enough to absorb the output the flotilla produces?" — and I don't have a clean answer to that yet. The next tooling-layer probably isn't subagent-level. It's subagent-output-aggregation-level: tools that consolidate 31 fanout outputs into a human-sized review packet. That's a different problem.


If you've stayed with the project this far, the next milestone — B-8 RSI Critic apply-mode live — is gated to W23 (2026-06-01..07). (The other line that was on this milestone yesterday, Option-B Jaccard ≥0.05, got reframed mid-sprint — see the section-5 postscript above. The complementarity targets cleared today.) Open an issue if you want to test-drive any of this on your own vault — the safety-gates are documented (multi-layer-safety-gate) and the auto-patch tool can dry-run against your repo without writing anything.


Cross-references (epilogue)#