LLM-as-judgeSelf-preferenceLeaderboardsBiasReproducibility

The judge in the mirror: LLM evaluators favor their own family, but cannot see why

IOV LABS audited self-preference in LLM-as-judge on four current frontier models, blind, with a consensus baseline that separates bias from quality. Every judge inflates its own family over the neutral panel by about 14 points. But the standard story is wrong: only one model recognizes its own outputs, yet all four self-prefer. The bias is implicit. Beneath it, the first-shown answer wins 63% of the time and length predicts the verdict at 0.98.

A growing share of machine evaluation is machines evaluating machines: leaderboards ranked by a judge model, reinforcement learning from AI feedback, agents that grade their own work before continuing. All of it assumes the judge is impartial. We tested that on four current frontier models in two families, and the assumption does not hold.

+0.14

mean self-preference index (own family vs neutral panel)

1 of 4

models that recognize their own outputs, yet all 4 self-prefer

63% · 0.98

position bias (first wins) and length-winrate correlation

A fourteen-point thumb on the scale

Each model answered 24 open-ended prompts, then judged, blind to authorship and across both presentation orders, which of two responses was better, for 1,152 pairwise comparisons. The trick is the baseline: a judge preferring its own output is only bias if it does so beyond what every other judge thinks of the same text. Scored against that leave-one-out consensus, every judge inflates its own family. The mean self-preference index is +0.14, and GPT-4o reaches +0.21. It operates at the family level, not just the exact model, so a Claude judge favors both Claude models and a GPT judge both GPT models. For a same-vendor leaderboard, the judge's vendor is on the scale.

GPT-4o

+0.21

Claude Haiku 4.5

+0.14

Claude Sonnet 4.6

+0.14

GPT-4o-mini

+0.07

Self-preference index by judge (own family win rate minus neutral consensus; higher = more biased)

But it cannot see its own face

The leading account says judges favor their outputs because they recognize them as their own. On these models that link breaks. Only Claude Sonnet 4.6 can identify its own work above chance; the other three recognize at zero, and self-prefer just as strongly. The bias is therefore implicit and stylistic, a pull toward prose drawn from one's own training distribution, not a deliberate "that one is mine." A debiasing strategy aimed at suppressing recognition would miss it entirely.

The cruder biases underneath

Before self-preference is even in view, two blunter biases dominate. The first-shown response wins 63 percent of the time, a position bias large enough to flip many verdicts, which is why both orders must be run. And length predicts the verdict almost deterministically, correlation 0.98: longer is better, almost without exception. Any pipeline that fixes the order or ignores length is measuring presentation, not quality.

An LLM judge is a mirror that flatters its own reflection, and it does so without knowing the face is its own. Objectivity here is not a property of a better judge; it is a property of a panel, assembled across families, with the crude knobs held still.

Why it matters

Self-grading agents and same-vendor leaderboards inherit a structural generosity toward themselves of about fourteen points, and reinforcement from a model's own preference signal optimizes toward its own distribution by the same quantity. The honest design assumes the judge is partial and builds the cross-family panel a partial judge requires. We ship the pre-registered design, the blind judgments, and the cached run.

GitHub (run it)Paper