Kihagyás

Metric-design pivot, not algorithm#

When a target metric doesn't move after multiple well-executed algorithm changes, suspect the metric itself, not the algorithm. A misformed metric is invisible until you stress-test it — every improvement looks like a failure.

The pattern#

You have a target metric M and a goal value T. You ship algorithm changes A1, A2, A3 — each one should structurally help. M stays roughly flat. The instinct is "we need A4 / better A3 / a tweak to A2." The honest move is often "M is the wrong metric."

Concrete signals that M is misformed:

  • Different domain-knowledgable extractors should agree on M but they have low M-overlap by-design (different vocabularies, different resolutions).
  • A secondary axis is at ceiling already (e.g. an XR or coverage axis = 1.0) while M is at the floor — the agreement signal is there, just not where M is looking.
  • The algorithm changes shrink the denominator without growing the numerator in M's formula. Each "improvement" makes M technically worse via union-growth, even though the actual content quality improves.

Worked example — Jaccard label-overlap (2026-05-19→20)#

The vault-graph-diff between Memgraph Tier-1 (LLM-extracted narrative concepts) and graphify Tier-2 (markdown-section-paths) had a Phase-4 acceptance gate of Jaccard ≥ 0.05. Movement after 4 algorithm changes:

Algorithm change Jaccard before Jaccard after Delta
Phase-1 Tier-A noise-cleanup (-791 entity) 0.0070 0.0070 +0
Phase-2 Tier-C composite (-3074 entity) 0.0070 0.0078 +0.0008
Phase-3 selective re-extract vocab v3 0.0078 0.0071 -0.0007
Option-B tree-sitter prepass (planned, +156 typed entities) 0.0071 0.0068 (projected) -0.0003

Four serious algorithm interventions, all empirically near-zero. The premise (Tier-1 and Tier-2 should share vocabulary) was empirically false — graphify-out inspection showed it parsed 367 .md + 2 .json + 1 .sh, zero Python source files. The "tree-sitter + Leiden" framing in graphify's README was about markdown parsing, NOT code-symbol extraction.

The pivot was metric-design, not algorithm: drop the Jaccard target, replace with file-coverage agreement (FCA) + co-occurrence density (CD) + cross-reference rate (XR). Empirical effect after 171-file backfill:

Metric Baseline Final Status
Jaccard (old) 0.0070 ~0.0070 flat (still misformed)
FCA (new) 0.4676 0.9297 ceiling reached
XR_T1 (new) 1.0000 1.0000 unchanged-truth — was always there
XR_T2 (new) 0.4461 0.8965 PASS

The XR_T1 = 1.0 unchanged across baseline → all checkpoints → final is the giveaway: every LLM-extracted concept has a graphify structural anchor. That signal existed at baseline. Jaccard hid it under a structurally-wrong question ("do the labels match?") when the right question was ("do they cover the same files?").

When to suspect the metric#

Signal Likely interpretation
3+ algorithm changes that "should" help, M stays flat M's denominator is the wrong universe
Multi-axis metric set has 1 axis already at ceiling That ceiling-axis is the real signal; M is hiding it
M is a label-overlap / set-intersection metric over disjoint-by-design vocabularies Use file-overlap / source-overlap instead
M decreases as you add what should be valuable data (because union grows faster than intersection) M is a ratio that punishes the numerator growth you want

Pattern reuse — when this applies beyond Jaccard#

  • BLEU/ROUGE for paraphrase eval — same family: lexical-overlap doesn't measure semantic equivalence, and a "better" paraphrase model often reduces BLEU. The fix is replacing with NLI-judge / semantic-similarity, not tuning more on BLEU.
  • Recall@K with K too small — if recall@5 plateaus while recall@50 keeps growing, the gate is the K, not the retrieval algorithm.
  • Coverage-pct against an unstable denominator — if you can shrink the denominator by exclusion-policy faster than you can grow the numerator by ingest, the metric is structurally unfair.