Tuning vs production-recall — honest reporting#
[!info] When to apply Before publishing a single-number benchmark result (R@K, accuracy, F1, recall, etc.) — cross-validate on at least 2 methodology-distinct query/test sets. The "tuning-recall" (measured on the set you iterated against) is always inflated relative to the "production-recall" (measured on unseen workloads). Quantify the gap. Publish the average, not the best.
A pattern#
Build tuning-set T (methodology M1)
Build held-out-set H (methodology M2, same domain, NO overlap in tuning iterations)
Tuning recall = recall(system, T) # the number you optimized
Held-out recall = recall(system, H) # the number you should report
Production recall ≈ (tuning + held-out) / 2 (often a fair estimate)
≈ min(tuning, held-out) (pessimistic but safer for users)
Never publish only the tuning number. It misleads downstream users about what they'll actually experience.
Concrete example — RRF hybrid-fusion (2026-05-20 finding)#
We built vault-search-fusion (RRF hybrid of vault-search + agentmemory) and tuned fetch-k, k_rrf on a 89-Q LongMemEval-style query set mined via IDF (rare-term 2-gram from session body).
| Step | What we measured | Result |
|---|---|---|
| 1. Initial benchmark (with duplicates) | IDF-mined, mixed-state agentmemory | 91.01% R@5 |
| 2. Clean re-setup (575 unique docs) | Same IDF-mined queries (TUNING) | 85.39% R@5 |
| 3. Cross-validation (heading-mined queries, HELD-OUT) | Same RRF system, different query-mining | 69.66% R@5 |
| 4. Honest average | (85.39 + 69.66) / 2 | 77.5% R@5 |
A 91% lett volna a marketing-szám. Egy 6pp duplicate-artifact + 8pp tuning-overfit, total 13pp inflation valós production-recall-hoz képest. Publikáláskor 77.5% volt a helyes (vagy a pesszimista 69.66%).
Hidden inflation sources#
| Source | Tipikus inflation | Detection |
|---|---|---|
| Data duplicates (multiple ingest-batches, near-duplicate content) | 3-10pp | Re-ingest from clean state, verify single entry per source |
| Title-leak / metadata-leak (target tokens appear in indexed metadata) | 5-20pp | Use generic title/metadata, measure delta |
| Query-distribution overfit (test queries mined the same way as tuning queries) | 5-15pp | Use a different query-mining method on held-out |
| Corpus-size mismatch (one system indexes more than another) | 5-30pp | Enforce identical corpus boundaries |
| Reranker over-tuned to test-set (cross-encoder seeing tuning queries) | 3-10pp | Hold out 30% of queries from any cross-encoder fine-tuning |
When honest-reporting matters most#
- OSS project benchmark in README (users will quote it) → publish honest cross-val
- HN/Reddit launch posts (community will fact-check) → publish honest cross-val + methodology
- Internal vs external recall claims (sales-deck vs engineering-doc) → same number both places, no double-standard
- Marketing pages (claims become commitment) → use the pessimistic number (min of tuning/held-out), not average
When tuning-recall is OK to cite#
- Internal optimization decision ("which knob to turn next") — tuning recall is the right signal for direction
- Ablation studies ("removing component X drops tuning by Y") — relative deltas hold across methodologies
- Engineering changelog / sprint-retro (clearly labeled as "on tuning set X")
Implementation tips#
- Always have ≥2 query-mining methodologies for the same domain. Build them up-front, before any tuning iteration. E.g., for retrieval: IDF-mined, heading-mined, hand-curated, NL-paraphrase (LLM-generated, on held-out content).
- Label benchmark output clearly:
R@5 (tuning, n=89, IDF),R@5 (held-out, n=89, heading),R@5 (average). Don't hide which is which. - Document the duplication-state of the index in every benchmark: "fresh re-ingest", "incremental updates", "with X duplicates of Y files". Duplicates are the most common 3-10pp inflation source and are easy to overlook.
- Re-publish corrected numbers when you find inflation. Don't suppress. The community respects honesty more than precision.
Generalizes to#
- G-Eval / LLM-as-judge ground-truth ceiling (g-eval-bias-mitigation-pattern) — measurement-classifier own noise sets observable-agreement ceiling
- A/B test reporting — winning condition on training-cohort ≠ winning condition on new-cohort
- Model-card publication — fine-tuned model "achieves X%" usually inflated 5-15pp vs production
- OSS retrieval benchmarks — BEIR/MTEB leaderboards have known overfit dynamics (researcher iterates against the public test-set)
Source verified#
- RRF hybrid-fusion finding: ../06-Audits/2026-05-20 Production-stack v2 — RRF fusion CLI + systemd + cron-mirror + cross-validation
- G-Eval ceiling analogue: g-eval-bias-mitigation-pattern "Mining-classifier ground-truth-ceiling (B-8 100-bullet κ=0.708 finding)"
- BEIR/MTEB known overfit: industry consensus, no single-cite