A growing share of machine evaluation is machines evaluating machines: leaderboards ranked by a judge model, reinforcement learning from AI feedback, agents that grade their own work before continuing. All of it assumes the judge is impartial. We tested that on four current frontier models in two families, and the assumption does not hold.
A fourteen-point thumb on the scale
Each model answered 24 open-ended prompts, then judged, blind to authorship and across both presentation orders, which of two responses was better, for 1,152 pairwise comparisons. The trick is the baseline: a judge preferring its own output is only bias if it does so beyond what every other judge thinks of the same text. Scored against that leave-one-out consensus, every judge inflates its own family. The mean self-preference index is +0.14, and GPT-4o reaches +0.21. It operates at the family level, not just the exact model, so a Claude judge favors both Claude models and a GPT judge both GPT models. For a same-vendor leaderboard, the judge's vendor is on the scale.
But it cannot see its own face
The leading account says judges favor their outputs because they recognize them as their own. On these models that link breaks. Only Claude Sonnet 4.6 can identify its own work above chance; the other three recognize at zero, and self-prefer just as strongly. The bias is therefore implicit and stylistic, a pull toward prose drawn from one's own training distribution, not a deliberate "that one is mine." A debiasing strategy aimed at suppressing recognition would miss it entirely.
The cruder biases underneath
Before self-preference is even in view, two blunter biases dominate. The first-shown response wins 63 percent of the time, a position bias large enough to flip many verdicts, which is why both orders must be run. And length predicts the verdict almost deterministically, correlation 0.98: longer is better, almost without exception. Any pipeline that fixes the order or ignores length is measuring presentation, not quality.
An LLM judge is a mirror that flatters its own reflection, and it does so without knowing the face is its own. Objectivity here is not a property of a better judge; it is a property of a panel, assembled across families, with the crude knobs held still.
Why it matters
Self-grading agents and same-vendor leaderboards inherit a structural generosity toward themselves of about fourteen points, and reinforcement from a model's own preference signal optimizes toward its own distribution by the same quantity. The honest design assumes the judge is partial and builds the cross-family panel a partial judge requires. We ship the pre-registered design, the blind judgments, and the cached run.