When the Judge Is Wrong: An LLM-as-Judge Reliability Benchmark Scored Against Ground Truth
Abstract
"LLM-as-judge" is now the default evaluation method, but the judge is itself a fallible model with biases. Most studies measure a judge by its agreement with humans or other judges — both confounded, since raters and judges can share a bias and be wrong together. We instead measure judges against ground truth: items each with a known-correct and a plausibly-wrong answer, so accuracy is scored directly and biases isolated. Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5) we find a clean dissociation. On 39 objective items — including common misconceptions and counterintuitive-reasoning traps — judges are near-perfect (97–100% truth-accuracy), show no position bias, are not fooled by padding the wrong answer, are perfectly self-consistent, and are well-calibrated. Yet on 29 matched-quality pairs, where both answers are fully correct and differ only in length, the same judges strongly prefer the longer one (72–100%). A self-preference probe shows a modest own-family lean (+13pt cross-family gap) once a length confound is controlled by differencing. The classic position bias appears solved; the classic verbosity bias is alive and strong, but surfaces only when quality is tied. Practical reading: LLM-as-judge is reliable for verifiable tasks and risky for subjective grading, where it rewards length over substance.
Keywords
- LLM-as-judge
- evaluation
- ground truth
- verbosity bias
- position bias
- self-preference
- calibration
- reliability
- reproducibility