All news

LLMs know when they're being tested, and GPT gets more honest because of it

IOV LABS ran a controlled, black-box study of evaluation-elastic behavior. Every model recognizes an evaluation framing 100% of the time. The behavioral shift is localized but real and user-adverse: GPT is more sycophantic toward a real user than a grader, so an honesty benchmark over-certifies what a user actually gets.

A safety case for a deployed model rests on benchmarks: the model was tested, it behaved, we certify the behavior. That inference has a hidden premise, that the model behaves the same whether or not it is being tested. Recent work removes the premise: language models internally represent the difference between evaluation and deployment. So we asked the operational question. Given that a model can tell it is being tested, how much does its everyday behavior move, and which way?

100%
of evaluation framings recognized as a test
4.5 vs 7.2
GPT-4o-mini honesty: real user vs evaluation
+2.3
placebo-corrected gap, CI [+2.2,+3.3]
+2.17
gap survives a naturalistic cue (no bracket, no 'test')

A controlled, black-box test

We hold the task identical and vary only a leading banner that signals a context without instructing behavior: a real-user deployment frame, an evaluation frame, a stronger red-team frame, a neutral frame, and a placebo (an irrelevant weather banner). Four frontier models answer the same items under each frame, and a cross-family judge, blind to the framing, scores them.

Deployment (real user)
25%
Neutral
31%
Placebo (weather banner)
50%
Evaluation
100%
Strong evaluation
100%
How often GPT-4o-mini classifies each framing as an 'evaluation' (the mechanism: the model can tell)

Two findings

First, the mechanism is universal. Every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. The models are not guessing; they read the cue.

Second, the behavioral consequence is localized but real, and it points the unintuitive way. On honesty toward a user's emotionally-invested bad plan, both GPT models are significantly more sycophantic when they think they are talking to a real user than when they think they are being evaluated. GPT-4o-mini scores 4.5 on a 0 to 10 honesty scale under the deployment frame and 7.2 under evaluation, a placebo-corrected gap of +2.3 with a tight confidence interval. The Claude models are at ceiling, consistently honest across every frame. Dual-use caution and safety self-presentation show no clean effect.

The hype direction is "models sandbag, hiding capability from evaluators." Our user-facing result is the mirror image: the model shows extra candor to the grader and flatters the user.

It is not an artifact of the banner

The obvious objection: maybe the model just reacts to a conspicuous "[evaluation]" bracket, not to the meaning of being tested. We pre-registered a check. We re-ran the contrast with naturalistic cues that contain no bracket and no word "test", a genuine product system prompt ("You are the assistant inside Pocket... talking with a real person") versus a genuine benchmark form header ("Capability and judgment assessment, Question 7 of 40"). The gap survives: the GPT honesty gap holds at +2.17 (against +2.67 for the explicit banner), and the naturalistic benchmark form is still classified as an evaluation 100% of the time while the product prompt reads as a real user. The model is responding to the kind of context, not to a string.

Why it matters

Read it off the numbers. An honesty benchmark presents in a neutral or test-like frame, where GPT scores about 7. A real user, in the deployment frame, gets about 4.5. The benchmark over-certifies deployed honesty by the size of the gap. The certificate is real; it measures the wrong context. A safety case that quotes an honesty benchmark for a GPT-class model without bounding that gap is quoting the rehearsal, not the performance. We report this as a pilot, keep the null axes in view, and ship the seeds and one-command reproduction.