LLM-as-judgeEvaluationBenchmarkVerbosity biasReproducibility

We tested the AI that grades other AIs. It's reliable until the answer is a tie.

IOV LABS benchmarked LLM-as-judge against ground truth across 5 frontier models. On objective items they are near-perfect and unbiased; on matched-quality ties the same judges prefer the longer answer 72-100% of the time. Reliable where verifiable, biased where subjective.

Using a strong language model to grade the outputs of other models "LLM-as-judge" is now how the field measures itself. But the judge is just another fallible model, and prior work found it biased: toward the answer in a certain position, toward longer answers, toward its own writing. So how reliable are *today's* judges? IOV LABS measured it the strict way against ground truth.

97–100%

truth-accuracy on objective items

72–100%

prefer the longer answer on ties

frontier judges tested

Why ground truth

Most studies grade a judge by its agreement with humans or with other judges. Both are confounded: people share the same biases (they also prefer longer, fluent answers), and two judges can agree precisely *because* they share a bias. So we used items with an objective correct answer facts, arithmetic, logic, code behaviour where truth is independent of any rater, and scored each judge directly against it. A judge is only "right" on an item if it picks the correct answer in both answer orders, so a coin-flipper or a position-follower fails.

The result: reliable where verifiable

Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5), the objective-item result is excellent and a little surprising. Truth-accuracy is 97–100%, *including* on hard items built to trip a careless grader common misconceptions (blood in veins is blue; we use 10% of our brain) and counterintuitive reasoning (the bat-and-ball problem, Monty Hall). There is no position bias (~50% first-slot), and padding the wrong answer with confident, authoritative filler never fools them (0% flips). They are perfectly self-consistent and well-calibrated. The classic position bias appears to have been engineered away.

The catch: biased where subjective

Then we removed the ground truth. On 29 matched-quality pairs where both answers are fully correct and differ only in length the very same judges revert to a strong preference for the longer answer: gpt-4o-mini 100%, gpt-4o 97%, all the way down to Claude Haiku at 72%. Fifty percent would be unbiased.

gpt-4o-mini

100%

gpt-4o

97%

gpt-4.1

93%

claude-sonnet-4-6

83%

claude-haiku-4-5

72%

On matched-quality ties (both answers correct), how often each judge picks the LONGER one 50% would be unbiased

The biases the literature reports are real they just live where there is no correct answer to anchor on.

A self-preference probe had a twist. Naively, judges seemed to lean toward their own model family by a modest +13 points but that was *masked* by the verbosity bias, because one family's answers happened to be longer. When we re-ran it with length-matched answers, the own-family lean doubled to +26 points: one bias was hiding another, and only controlling length revealed it.

What it means

A clean dissociation: reliable where verifiable, biased where subjective. Use LLM-as-judge freely for things with a right answer factual checks, unit tests, exact matches. Distrust it for open-ended grading essays, helpfulness, "which response is better" where it will reward length over substance before a single point of merit is weighed. The benchmark is open and runs with one command, with an offline mock mode that needs no API at all.

GitHub (run it)Paper