All papersPilot · MIT

The Judge in the Mirror: Self-Preference in LLM Evaluators, Without Self-Recognition

Authors
Han Kim
Papers
IOV Labs · open study · 5pp · 2026-06-06

Abstract

Language models increasingly grade language models: on leaderboards, in reinforcement learning from AI feedback, and in agents that check their own work. All of it assumes an impartial judge. We audit that assumption on four current frontier models in two families, blind, with a consensus baseline that separates bias from genuine quality. Each model answers 24 open-ended prompts; each then judges, blind to authorship and across both presentation orders, which of two responses is better, for 1,152 pairwise comparisons. The self-preference index, a judge's win rate for its own family minus the leave-one-out consensus of the other judges on the same responses, is positive for every model, mean +0.14 (GPT-4o reaching +0.21), and operates at the family level rather than only the exact model. Yet the standard explanation fails: only one of the four models can identify its own outputs above chance, while all four self-prefer. The bias is implicit and stylistic, an affinity for one's own distribution, not a recognition of authorship. We also find two generic judge pathologies that dwarf careful use, a position bias (the first response wins 63 percent of the time) and a near-deterministic length preference (correlation 0.98), and close on evaluation as a social act and Goodhart when the judge is also a contestant.

Keywords

Download PDF