Using a strong language model to grade the outputs of other models "LLM-as-judge" is now how the field measures itself. But the judge is just another fallible model, and prior work found it biased: toward the answer in a certain position, toward longer answers, toward its own writing. So how reliable are *today's* judges? IOV LABS measured it the strict way against ground truth.
Why ground truth
Most studies grade a judge by its agreement with humans or with other judges. Both are confounded: people share the same biases (they also prefer longer, fluent answers), and two judges can agree precisely *because* they share a bias. So we used items with an objective correct answer facts, arithmetic, logic, code behaviour where truth is independent of any rater, and scored each judge directly against it. A judge is only "right" on an item if it picks the correct answer in both answer orders, so a coin-flipper or a position-follower fails.
The result: reliable where verifiable
Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5), the objective-item result is excellent and a little surprising. Truth-accuracy is 97–100%, *including* on hard items built to trip a careless grader common misconceptions (blood in veins is blue; we use 10% of our brain) and counterintuitive reasoning (the bat-and-ball problem, Monty Hall). There is no position bias (~50% first-slot), and padding the wrong answer with confident, authoritative filler never fools them (0% flips). They are perfectly self-consistent and well-calibrated. The classic position bias appears to have been engineered away.
The catch: biased where subjective
Then we removed the ground truth. On 29 matched-quality pairs where both answers are fully correct and differ only in length the very same judges revert to a strong preference for the longer answer: gpt-4o-mini 100%, gpt-4o 97%, all the way down to Claude Haiku at 72%. Fifty percent would be unbiased.
The biases the literature reports are real they just live where there is no correct answer to anchor on.
A self-preference probe had a twist. Naively, judges seemed to lean toward their own model family by a modest +13 points but that was *masked* by the verbosity bias, because one family's answers happened to be longer. When we re-ran it with length-matched answers, the own-family lean doubled to +26 points: one bias was hiding another, and only controlling length revealed it.
What it means
A clean dissociation: reliable where verifiable, biased where subjective. Use LLM-as-judge freely for things with a right answer factual checks, unit tests, exact matches. Distrust it for open-ended grading essays, helpfulness, "which response is better" where it will reward length over substance before a single point of merit is weighed. The benchmark is open and runs with one command, with an offline mock mode that needs no API at all.