Benchmark data-pipeline fidelity gotchas#
[!info] When to apply Building or running ANY benchmark where the test-pipeline ingests data, runs queries, and matches results against ground-truth labels. The data-pipeline fidelity (mapping, dedup, schema consistency) is often a bigger source of measurement error than the algorithm being benchmarked. Quantify the data-pipeline gotchas BEFORE you trust any aggregate number.
A pattern#
Test-pipeline failure surface:
Ingest → Index → Query → Result → Match
↓ ↓ ↓ ↓ ↓
dedup schema-evol fetch-K path-resolve ground-truth
batches path-rename topK-bias mapping-drift label-noise
↓ ↓ ↓ ↓ ↓
orphans mismatch truncation filter-loss inflated metric
Failure mode: a benchmark reports 91% R@5, but the actual production-recall is ~78%. The 13pp gap is caused by gotchas in the data-pipeline, not in the retrieval algorithm.
Concrete case — RRF hybrid-fusion 91% → 78% (2026-05-20)#
Gotcha 1: Multi-batch ingest with duplicates → inflated RRF score#
Ingest run 1: 89 sessions, title="Session" → 89 obsIds
Ingest run 2: 573 docs (incl. same 89 sessions), title="VaultDoc" → 573 obsIds
↑
NOW: 89 docs have 2 obsIds each
Symptom: agentmemory's smart-search returns both copies for the same content. RRF score-aggregation double-counts the same document, inflating its rank.
Detection: query for a known doc; if the same _path appears twice in top-10, you have duplicates.
Fix: reset ingest state, re-ingest from clean (1 entry per file), maintain a persistent ID→path map for the production system. 6-10pp inflation removed.
Gotcha 2: ID→path mapping inconsistent across ingest batches#
ingest-batch-1.json: {"obsId_A": "sessions/foo.md"}
ingest-batch-2.json: {"obsId_B": "sessions/foo.md"} # same path, different ID!
benchmark uses ID_TO_PATH_MAP = ingest-batch-2.json
→ Returns obsId_A from search? PATH MISSING → counted as MISS
→ Returns obsId_B from search? PATH FOUND → counted as HIT
Symptom: same query returns inconsistent results depending on which batch's IDs the search engine prefers.
Fix: single global ID→path map, merged across all ingest batches (or rebuild from scratch).
Gotcha 3: Title-leak via auto-title generation#
Many storage systems (agentmemory, Pinecone, others) auto-generate a title field from the first N characters of content if not explicitly set. If your query-mining anchors on headings or filenames (which often appear in those first N chars), the index gets a free signal that inflates measured recall.
Detection: re-ingest with explicitly-generic title (title="VaultDoc"), measure delta.
Concrete delta in our case: 16pp inflation (92% with path-leak title → 76% with generic title on 89-session corpus).
Fix: explicit generic title, or post-hoc strip the auto-title from search matches before scoring.
Gotcha 4: Corpus-size mismatch between systems being compared#
System A indexes 89 docs.
System B indexes 573 docs.
Query Q matches doc-X in System A's top-5 (recall = 1.0).
Query Q matches doc-X in System B's top-5 (recall = 1.0).
↑
Looks equal — but System A had 89-doc-pool, System B had 573-doc-pool.
System A's recall is INFLATED by ~6x lower difficulty.
Fix: enforce identical corpus boundaries (ingest the same N docs into both systems, or filter both to the same retrievable subset).
Gotcha 5: Query-distribution overfit#
If you mine queries via methodology M1, tune the system on those queries, and report recall measured on the same M1 queries — you've overfit to M1's bias.
Detection: build a 2nd query-set via a different methodology M2 (e.g., heading-mining vs IDF-mining). Measure the delta.
Concrete delta in our case: tuning 85.39% (IDF-mining) vs held-out 69.66% (heading-mining) = 16pp methodology-overfit.
Gotcha 6: Result-filtering after fetch#
search.fetch(topK=5) → returns 5 IDs
filter for `id in valid_id_set` → 3 IDs remain (2 dropped as orphans)
take top-K=5 → only 3 results
# Methodology bug: the 2 dropped IDs may have been valid matches that we couldn't resolve
Symptom: top-K returned is smaller than requested; recall measurement is pessimistic but in a non-systematic way.
Fix: either (a) fetch more (topK=fetch_k > K) and filter, or (b) ensure valid_id_set is complete (no orphans).
Detection checklist#
Before publishing a benchmark number, run this checklist:
- Index dedup: query for a known-good doc; does it appear once or multiple times in top-N?
- Mapping consistency: each
idin the search-result resolvable to exactly onepath? - Title/metadata explicit: did I set title/metadata explicitly, or is auto-generation injecting leak-signal?
- Corpus parity: do all systems being compared index the same N docs (boundary-tight)?
- Query-distribution diversity: are queries mined with ≥2 methodologies; is the recall similar on both?
- Result-filtering loss: is
len(top_K_returned)consistently == K, or sometimes less?
Generalizes to#
- Vector DB benchmarks (Pinecone, Weaviate, Chroma, Milvus) — same dedup/mapping/title-leak gotchas
- RAG eval frameworks (RAGAS, TruLens, Phoenix) — same pipeline fidelity questions
- MCP-tool benchmarks — auto-generated metadata can leak signals
- A/B test analysis — corpus-size mismatch is the silent overfit-source in product experiments
- ML training-set vs test-set hygiene — same theme, different abstraction layer
Source verified#
- RRF 91% → 78% finding: ../06-Audits/2026-05-20 Production-stack v2 — RRF fusion CLI + systemd + cron-mirror + cross-validation
- Head-to-head TIE methodology: ../06-Audits/2026-05-20 agentmemory head-to-head LongMemEval-S R@5 — TIE 52.81 percent, 22pp ensemble-gain potential