Reranker cost optimization — NOT size, trigger rate#
🎧 Audio overview#
- Deep-dive podcast (NotebookLM 2-host, ~5 min, EN): reranker-cost-optimization-not-size.en-podcast.mp3 (39 MB) — "Why Smaller AI Models Worsen Search"
The problem#
Cross-encoder reranker (bge-reranker-v2-m3, 568MB) takes warm wall-clock 20s / 30 candidates × max_length=256 on CPU. The naive mitigation: smaller model. But an A/B test refuted this: bge-reranker-base (277MB) gives a 3.84× speedup (warm-median 5.4s), BUT the <500ms target is UNREACHABLE on CPU with any XLM-R-base model.
A/B test result (5 queries × 3 repeats, daemon warm)#
| Metric | v2-m3 (568MB) | base (277MB) | Δ |
|---|---|---|---|
| Wall-clock warm median | 20.9s | 5.4s | 3.84× speedup |
Server-side rerank_ms median | 20.7s | 5.2s | 3.97× |
| RPC overhead | ~270ms | ~270ms | — |
| Top-3 overlap | — | — | 12/15 = 80% |
| RAM peak (both loaded) | 4543MB | 5001MB | +458MB |
| Cold-load latency | ~24s | 17.7s | -26% |
Verdict: opt-in (--reranker-model base) released, default shift NOT recommended due to 10-15% precision loss on noisy top-3.
The real cost map#
Reranker total cost = (trigger_rate × per_invocation_cost)
per_invocation_cost = (encode_time × n_candidates) + tokenize_overhead
REDUCTION axes:
┌─ trigger_rate ⇣ ── smart-trigger (cosine < 0.65) ────── -30-50% calls
│ ── score-gap smart-skip ─────────────── -30% additional
│
├─ encode_time ⇣ ── model-size reduction (v2-m3 → base) ── -75% per call
│ ── ONNX-INT8 quantization ────────────── -50% additional
│ ── batch-size optimization ───────────── -10-20%
│
├─ n_candidates ⇣ ── top-K reduce (50 → 30) ─────────── -40%
│ ── early-termination cosine-cap ────── -10-30%
│
└─ tokenize_overhead ⇣ ── max_length reduce (256 → 128) ── -50% / acc-loss 5-10%
── query-cache (LRU 5min TTL) ──── -50% for repeat queries
Best ROI: trigger-rate optimization (-50%) + ONNX-INT8 (-50%) → 4× speedup without changing model size = ~5s wall-clock (still 10× over <500ms target).
Sub-second CPU is unreachable — accept and route#
An XLM-R-base cross-encoder with 30-candidate × 256-token reranking on CPU has a fundamental latency floor of ~3-4s (matmul-bound). The <500ms target is reachable only via:
- GPU — 30-50× speedup (10s → 200-300ms), but infrastructure cost
- ColBERT-style late interaction — pre-computed doc embeddings, sub-50ms query time (Reimers et al. 2020), but query-side encoding still needed
- DistilBERT-quantize-ONNX — 50MB model, INT8 + AVX2 → ~1-2s (still 2-4× over target)
- Re-route: skip reranking, do dense top-K fine-tuning instead — query-specific cross-encoder rerank only on top-3 (n_candidates=3, ~1s reachable)
When the 3.84× opt-in is worthwhile#
- ✅ Bulk retrieval (>30 queries) where 10-15% precision loss is tolerated
- ✅ Latency budget 5-10s range (NOT interactive)
- ❌ Interactive UX (search-as-you-type, <500ms target)
- ❌ High-precision use case (audit, citation)
Design principles#
- Trigger-rate FIRST — score-gap smart-skip + cosine-cap is the cheapest implementation, no RAM cost
- Per-task model choice —
--reranker-model {v2-m3|base|auto}flag, multi-model cacheRerankerSingleton - Default OFF for base — v2-m3 stays default due to 80% top-3 overlap
- ONNX-INT8 follow-up — quantization-aware re-export, ~50% additional speedup
- Query cache 5min TTL — repeat queries (cron-monitor, dashboard-refresh) -50% cost
Trade-offs#
- ⚠️ Sub-second CPU target unreachable — accept and prioritize trigger reduction
- ⚠️ Multi-model cache RAM +458MB (both loaded)
- ✅ 3.84× speedup in opt-in mode (5s wall-clock still manageable)
- ✅ Precision loss 10-15% acceptable in bulk-retrieval context
Live application#
A search service with --reranker-model {v2-m3|base|auto} flag is in production. Multi-model cache RerankerSingleton per-HF-id lazy-load. Smart-trigger threshold 0.65 stays constant. Follow-up: score-gap smart-skip + ONNX-INT8 + query cache.
Related#
- smart-trigger-cost-pattern — cheap-baseline + expensive-second-pass base pattern
- memgraph-ce-feature-limits — native vector index (cosine first-pass)
- sv-01-memory-architecture — search architecture research
- vendor-feature-verify-before-workaround — analogous lesson (release-cycle check)