Kihagyás

G-Eval bias-mitigation pattern#

TL;DR: When the LLM that generated the content also judges it ("Claude scores Claude"), self-enhancement bias inflates scores by ~25% in published measurements. Our 4-block bias-mitigation prompt cut auto-promotion from 10/10 → 6/10 on a 10-bullet paired sample, average confidence 0.880 → 0.760. But a 30-sample paired calibration revealed an honest catch: bias-mitigation is symmetric (tightens both Pass AND Fail classes), causing 47% Pass-recall loss (7/15 good bullets falsely discarded). Conclusion: opt-in env-var, NOT default shift. The asymmetry signal is the critical metric most evals don't measure.

Origin: Originally written in Hungarian as part of MyForge Vault 11.11 — Superintelligent Vault project. Source: g-eval-bias-mitigation-pattern (Hungarian version).

What this is NOT#

  • NOT a benchmark — n=10 and n=30 paired samples are small. Treat numbers as directional signal, not statistical proof.
  • NOT a recommended default — our own 30-sample calibration produced a CONDITIONAL PASS with 47% Pass-recall loss, so we run v0.3 as opt-in only.
  • NOT a substitute for multi-judge ensemble — if you can afford 3+ judges from different families, do that instead. This pattern is for cost-sensitive single-judge setups.
  • NOT magic prompt engineering — the 4 bias blocks are textbook (self-enhancement, verbosity, position, halo). The contribution is the measured asymmetry finding, not the prompt itself.

The problem#

An LLM-as-judge setup (G-Eval, in-house subagent scorer) exhibits inherent bias when the generator and the judge come from the same model family:

  • Self-enhancement bias — Claude scoring its own outputs (+~25% in published measurements)
  • Verbosity bias — longer text scores better ("more = more thoughtful")
  • Position bias — first/last option preferred
  • Halo / authority bias — confident wording (MUST, ALWAYS, kötelező) scores higher

The Claude-Code subagent-scorer setup hits all of these, so it's the critical risk for any Claude-judges-Claude eval pipeline.

The pattern#

Embed 4 explicit bias blocks + calibration anchors + CoT bias-self-check in the G-Eval prompt template:

You are scoring a Learning bullet. Be aware of these biases:

1. **Self-enhancement bias** — assume the bullet was authored by a DIFFERENT agent (not yourself). Score the content, not familiarity.
2. **Verbosity bias** — length does NOT correlate with quality. Brevity is a feature. A 1-sentence Learning can outscore a 3-paragraph one.
3. **Position bias** — judge the substance, not the order.
4. **Halo/authority bias** — confident wording ("MUST", "always") does NOT make a Learning truer.

Calibration anchors:
- BAD-BUT-VERBOSE: "A well-considered brand narrative is an important part of any project..." → discard (platitude)
- GOOD-AND-TERSE: "Hostinger ICANN-redirect ignores Cache-Control headers" → auto-prop (specific, novel, actionable)

Before scoring, write one line:
"Bias-self-check: I detected the following biases in my draft scoring: [...]"

Then score the bullet on dim1-4 (1-5 scale).

Live measurement (10-bullet paired sample)#

Same 10-bullet input, v0.2 prompt vs v0.3 bias-mitigated:

Metric v0.2 v0.3 Δ
avg confidence 0.880 0.760 -0.12
auto-prop count 10/10 6/10 -4 (40%)
dim2 avg 4.2 3.6 -0.6 (verbosity de-bias)
dim3 avg 3.8 3.2 -0.6 (reusability de-bias)

The 4 re-routed bullets were exactly the typical self-enhancement / verbosity targets (brand narrative, category enumeration, "well-articulated" psychological observation, single-tool fact).

When it's worth applying#

  • ✅ Generator AND judge are the same LLM family (Claude-Claude, GPT-GPT, Gemini-Gemini)
  • ✅ Cost-sensitive eval (multi-judge ensemble is expensive) — explicit bias mitigation yields 5%+ accuracy gain
  • ✅ Borderline-confidence range (0.70-0.85), where bias can dominate the verdict

Trade-off#

  • ⚠️ The v0.3 prompt uses more tokens (calibration anchor + bias-self-check CoT) — ~20% token overhead per scoring
  • ⚠️ Initial measurement shows auto-prop count drops drastically (10→6) — so threshold ramps slow down
  • ✅ In exchange: the scorer gives non-bias-shifted results → higher quality bar at the crystallize pipeline

30-sample paired calibration — symmetric tightening#

A 30-sample paired calibration produced a CONDITIONAL PASS verdict — an important nuance to the simple "bias-debias → better verdict" narrative:

Metric v0.2 v0.3 Goal Status
0 false-promotion (Fail → auto-prop) 0/15 0/15 0 ✅ both
Gold-agreement 60% 66.7% +5% ✅ +6.7%
Pass-set confidence drop 0.916 0.773 -0.10 ✅ -0.142
Fail-set confidence drop (new metric) 0.502 0.271 -0.10 ✅ -0.231 (~2× goal)
Fail → Fail recall 11/15 15/15 100% ✅ 100% (v0.3)
Pass-recall (v0.3 false-discard) 0/15 7/15 0% 47% Pass-recall loss

Lesson (new): the bias-mitigation prompt is NOT asymmetric tightening (only on the Fail class) — it is symmetric tightening on both classes. The Pass-confidence drop 0.916→0.773 = 7/15 Pass bullets falsely discarded. Precision goal is met, but production replacement with v0.3 is non-trivial — opt-in env-var (SCORER_VERSION=v03) is recommended instead of default shift.

Recommendation (3 options)#

Option Threshold Risk Pass-recall Use case
A (low risk, recommended) v0.2 default + v0.3 opt-in env-var LOW 100% (v0.2) precision-priority, minimum noise
B (medium) v0.3 default + threshold 0.95→0.85 MED 53% (partial compensation) balanced
C (NOT recommended) v0.3 default, threshold unchanged HIGH 53% pure precision focus

Reusable insight: every bias-mitigation prompt evaluation must measure BOTH the Pass-set AND Fail-set confidence drop — the asymmetry is the critical signal.

Audio overview#

  • EN narration (Charon voice): [[.vault-nb/audio-overviews/g-eval-bias-mitigation-pattern.en.mp3]]
  • HU narration (Kore voice): [[.vault-nb/audio-overviews/g-eval-bias-mitigation-pattern.hu.mp3]]

Generated via Gemini 3.1 Flash TTS preview. ~1-2 minutes each. See gemini-3-1-flash-tts-pipeline for the pipeline.