Metric-design pivot, not algorithm#
When a target metric doesn't move after multiple well-executed algorithm changes, suspect the metric itself, not the algorithm. A misformed metric is invisible until you stress-test it — every improvement looks like a failure.
The pattern#
You have a target metric M and a goal value T. You ship algorithm changes A1, A2, A3 — each one should structurally help. M stays roughly flat. The instinct is "we need A4 / better A3 / a tweak to A2." The honest move is often "M is the wrong metric."
Concrete signals that M is misformed:
- Different domain-knowledgable extractors should agree on M but they have low M-overlap by-design (different vocabularies, different resolutions).
- A secondary axis is at ceiling already (e.g. an XR or coverage axis = 1.0) while M is at the floor — the agreement signal is there, just not where M is looking.
- The algorithm changes shrink the denominator without growing the numerator in M's formula. Each "improvement" makes M technically worse via union-growth, even though the actual content quality improves.
Worked example — Jaccard label-overlap (2026-05-19→20)#
The vault-graph-diff between Memgraph Tier-1 (LLM-extracted narrative concepts) and graphify Tier-2 (markdown-section-paths) had a Phase-4 acceptance gate of Jaccard ≥ 0.05. Movement after 4 algorithm changes:
| Algorithm change | Jaccard before | Jaccard after | Delta |
|---|---|---|---|
| Phase-1 Tier-A noise-cleanup (-791 entity) | 0.0070 | 0.0070 | +0 |
| Phase-2 Tier-C composite (-3074 entity) | 0.0070 | 0.0078 | +0.0008 |
| Phase-3 selective re-extract vocab v3 | 0.0078 | 0.0071 | -0.0007 |
| Option-B tree-sitter prepass (planned, +156 typed entities) | 0.0071 | 0.0068 (projected) | -0.0003 |
Four serious algorithm interventions, all empirically near-zero. The premise (Tier-1 and Tier-2 should share vocabulary) was empirically false — graphify-out inspection showed it parsed 367 .md + 2 .json + 1 .sh, zero Python source files. The "tree-sitter + Leiden" framing in graphify's README was about markdown parsing, NOT code-symbol extraction.
The pivot was metric-design, not algorithm: drop the Jaccard target, replace with file-coverage agreement (FCA) + co-occurrence density (CD) + cross-reference rate (XR). Empirical effect after 171-file backfill:
| Metric | Baseline | Final | Status |
|---|---|---|---|
| Jaccard (old) | 0.0070 | ~0.0070 | flat (still misformed) |
| FCA (new) | 0.4676 | 0.9297 | ceiling reached |
| XR_T1 (new) | 1.0000 | 1.0000 | unchanged-truth — was always there |
| XR_T2 (new) | 0.4461 | 0.8965 | PASS |
The XR_T1 = 1.0 unchanged across baseline → all checkpoints → final is the giveaway: every LLM-extracted concept has a graphify structural anchor. That signal existed at baseline. Jaccard hid it under a structurally-wrong question ("do the labels match?") when the right question was ("do they cover the same files?").
When to suspect the metric#
| Signal | Likely interpretation |
|---|---|
| 3+ algorithm changes that "should" help, M stays flat | M's denominator is the wrong universe |
| Multi-axis metric set has 1 axis already at ceiling | That ceiling-axis is the real signal; M is hiding it |
| M is a label-overlap / set-intersection metric over disjoint-by-design vocabularies | Use file-overlap / source-overlap instead |
| M decreases as you add what should be valuable data (because union grows faster than intersection) | M is a ratio that punishes the numerator growth you want |
Pattern reuse — when this applies beyond Jaccard#
- BLEU/ROUGE for paraphrase eval — same family: lexical-overlap doesn't measure semantic equivalence, and a "better" paraphrase model often reduces BLEU. The fix is replacing with NLI-judge / semantic-similarity, not tuning more on BLEU.
- Recall@K with K too small — if recall@5 plateaus while recall@50 keeps growing, the gate is the K, not the retrieval algorithm.
- Coverage-pct against an unstable denominator — if you can shrink the denominator by exclusion-policy faster than you can grow the numerator by ingest, the metric is structurally unfair.
Related#
- ../07-Decisions/2026-05-20 Two-tier complementarity over Jaccard label-overlap — the ADR this pattern is extracted from
- ../06-Audits/2026-05-20 Option-B premise empirical refutation — graphify vocabulary is markdown not code — the diagnostic that triggered the pivot
- tuning-vs-production-recall-honest-reporting — sibling pattern (metric-honesty vs marketing-recall)
- vendor-feature-verify-before-workaround — sibling pattern (verify-the-data-not-the-tooling-metadata)
- two-tier-graph-extraction — the system this lesson came from