G-Eval bias-mitigation pattern#

TL;DR: When the LLM that generated the content also judges it ("Claude scores Claude"), self-enhancement bias inflates scores by ~25% in published measurements. Our 4-block bias-mitigation prompt cut auto-promotion from 10/10 → 6/10 on a 10-bullet paired sample, average confidence 0.880 → 0.760. But a 30-sample paired calibration revealed an honest catch: bias-mitigation is symmetric (tightens both Pass AND Fail classes), causing 47% Pass-recall loss (7/15 good bullets falsely discarded). Conclusion: opt-in env-var, NOT default shift. The asymmetry signal is the critical metric most evals don't measure.

Origin: Originally written in Hungarian as part of MyForge Vault 11.11 — Superintelligent Vault project. Source: g-eval-bias-mitigation-pattern (Hungarian version).

What this is NOT#

NOT a benchmark — n=10 and n=30 paired samples are small. Treat numbers as directional signal, not statistical proof.
NOT a recommended default — our own 30-sample calibration produced a CONDITIONAL PASS with 47% Pass-recall loss, so we run v0.3 as opt-in only.
NOT a substitute for multi-judge ensemble — if you can afford 3+ judges from different families, do that instead. This pattern is for cost-sensitive single-judge setups.
NOT magic prompt engineering — the 4 bias blocks are textbook (self-enhancement, verbosity, position, halo). The contribution is the measured asymmetry finding, not the prompt itself.

The problem#

An LLM-as-judge setup (G-Eval, in-house subagent scorer) exhibits inherent bias when the generator and the judge come from the same model family:

Self-enhancement bias — Claude scoring its own outputs (+~25% in published measurements)
Verbosity bias — longer text scores better ("more = more thoughtful")
Position bias — first/last option preferred
Halo / authority bias — confident wording (MUST, ALWAYS, kötelező) scores higher

The Claude-Code subagent-scorer setup hits all of these, so it's the critical risk for any Claude-judges-Claude eval pipeline.

The pattern#

Embed 4 explicit bias blocks + calibration anchors + CoT bias-self-check in the G-Eval prompt template:

You are scoring a Learning bullet. Be aware of these biases:

1. **Self-enhancement bias** — assume the bullet was authored by a DIFFERENT agent (not yourself). Score the content, not familiarity.
2. **Verbosity bias** — length does NOT correlate with quality. Brevity is a feature. A 1-sentence Learning can outscore a 3-paragraph one.
3. **Position bias** — judge the substance, not the order.
4. **Halo/authority bias** — confident wording ("MUST", "always") does NOT make a Learning truer.

Calibration anchors:
- BAD-BUT-VERBOSE: "A well-considered brand narrative is an important part of any project..." → discard (platitude)
- GOOD-AND-TERSE: "Hostinger ICANN-redirect ignores Cache-Control headers" → auto-prop (specific, novel, actionable)

Before scoring, write one line:
"Bias-self-check: I detected the following biases in my draft scoring: [...]"

Then score the bullet on dim1-4 (1-5 scale).

Live measurement (10-bullet paired sample)#

Same 10-bullet input, v0.2 prompt vs v0.3 bias-mitigated:

Metric	v0.2	v0.3	Δ
avg confidence	0.880	0.760	-0.12
auto-prop count	10/10	6/10	-4 (40%)
dim2 avg	4.2	3.6	-0.6 (verbosity de-bias)
dim3 avg	3.8	3.2	-0.6 (reusability de-bias)

The 4 re-routed bullets were exactly the typical self-enhancement / verbosity targets (brand narrative, category enumeration, "well-articulated" psychological observation, single-tool fact).

When it's worth applying#

✅ Generator AND judge are the same LLM family (Claude-Claude, GPT-GPT, Gemini-Gemini)
✅ Cost-sensitive eval (multi-judge ensemble is expensive) — explicit bias mitigation yields 5%+ accuracy gain
✅ Borderline-confidence range (0.70-0.85), where bias can dominate the verdict

Trade-off#

⚠️ The v0.3 prompt uses more tokens (calibration anchor + bias-self-check CoT) — ~20% token overhead per scoring
⚠️ Initial measurement shows auto-prop count drops drastically (10→6) — so threshold ramps slow down
✅ In exchange: the scorer gives non-bias-shifted results → higher quality bar at the crystallize pipeline

30-sample paired calibration — symmetric tightening#

A 30-sample paired calibration produced a CONDITIONAL PASS verdict — an important nuance to the simple "bias-debias → better verdict" narrative:

Metric	v0.2	v0.3	Goal	Status
0 false-promotion (Fail → auto-prop)	0/15	0/15	0	✅ both
Gold-agreement	60%	66.7%	+5%	✅ +6.7%
Pass-set confidence drop	0.916	0.773	-0.10	✅ -0.142
Fail-set confidence drop (new metric)	0.502	0.271	-0.10	✅ -0.231 (~2× goal)
Fail → Fail recall	11/15	15/15	100%	✅ 100% (v0.3)
Pass-recall (v0.3 false-discard)	0/15	7/15	0%	❌ 47% Pass-recall loss

Lesson (new): the bias-mitigation prompt is NOT asymmetric tightening (only on the Fail class) — it is symmetric tightening on both classes. The Pass-confidence drop 0.916→0.773 = 7/15 Pass bullets falsely discarded. Precision goal is met, but production replacement with v0.3 is non-trivial — opt-in env-var (SCORER_VERSION=v03) is recommended instead of default shift.

Recommendation (3 options)#

Option	Threshold	Risk	Pass-recall	Use case
A (low risk, recommended)	v0.2 default + v0.3 opt-in env-var	LOW	100% (v0.2)	precision-priority, minimum noise
B (medium)	v0.3 default + threshold 0.95→0.85	MED	53% (partial compensation)	balanced
C (NOT recommended)	v0.3 default, threshold unchanged	HIGH	53%	pure precision focus

Reusable insight: every bias-mitigation prompt evaluation must measure BOTH the Pass-set AND Fail-set confidence drop — the asymmetry is the critical signal.

Audio overview#

EN narration (Charon voice): [[.vault-nb/audio-overviews/g-eval-bias-mitigation-pattern.en.mp3]]
HU narration (Kore voice): [[.vault-nb/audio-overviews/g-eval-bias-mitigation-pattern.hu.mp3]]

Generated via Gemini 3.1 Flash TTS preview. ~1-2 minutes each. See gemini-3-1-flash-tts-pipeline for the pipeline.

Crystallization-protocol — host protocol
verification-step-before-claim — independent eval signal
claude-code-subagent-fanout — fanout pattern that often pairs with G-Eval scoring