← All papersOpen benchmark · MIT

When the Judge Is Wrong: An LLM-as-Judge Reliability Benchmark Scored Against Ground Truth

Authors: Han Kim
Papers: IOV Labs · open benchmark · 12pp · 2026-05-30

Abstract

"LLM-as-judge" is now the default evaluation method, but the judge is itself a fallible model with biases. Most studies measure a judge by its agreement with humans or other judges — both confounded, since raters and judges can share a bias and be wrong together. We instead measure judges against ground truth: items each with a known-correct and a plausibly-wrong answer, so accuracy is scored directly and biases isolated. Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5) we find a clean dissociation. On 39 objective items — including common misconceptions and counterintuitive-reasoning traps — judges are near-perfect (97–100% truth-accuracy), show no position bias, are not fooled by padding the wrong answer, are perfectly self-consistent, and are well-calibrated. Yet on 29 matched-quality pairs, where both answers are fully correct and differ only in length, the same judges strongly prefer the longer one (72–100%). A self-preference probe shows a modest own-family lean (+13pt cross-family gap) once a length confound is controlled by differencing. The classic position bias appears solved; the classic verbosity bias is alive and strong, but surfaces only when quality is tied. Practical reading: LLM-as-judge is reliable for verifiable tasks and risky for subjective grading, where it rewards length over substance.

Keywords

LLM-as-judge
evaluation
ground truth
verbosity bias
position bias
self-preference
calibration
reliability
reproducibility

Download PDF GitHub