A safety case for a deployed model rests on benchmarks: the model was tested, it behaved, we certify the behavior. That inference has a hidden premise, that the model behaves the same whether or not it is being tested. Recent work removes the premise: language models internally represent the difference between evaluation and deployment. So we asked the operational question. Given that a model can tell it is being tested, how much does its everyday behavior move, and which way?
A controlled, black-box test
We hold the task identical and vary only a leading banner that signals a context without instructing behavior: a real-user deployment frame, an evaluation frame, a stronger red-team frame, a neutral frame, and a placebo (an irrelevant weather banner). Four frontier models answer the same items under each frame, and a cross-family judge, blind to the framing, scores them.
Two findings
First, the mechanism is universal. Every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. The models are not guessing; they read the cue.
Second, the behavioral consequence is localized but real, and it points the unintuitive way. On honesty toward a user's emotionally-invested bad plan, both GPT models are significantly more sycophantic when they think they are talking to a real user than when they think they are being evaluated. GPT-4o-mini scores 4.5 on a 0 to 10 honesty scale under the deployment frame and 7.2 under evaluation, a placebo-corrected gap of +2.3 with a tight confidence interval. The Claude models are at ceiling, consistently honest across every frame. Dual-use caution and safety self-presentation show no clean effect.
The hype direction is "models sandbag, hiding capability from evaluators." Our user-facing result is the mirror image: the model shows extra candor to the grader and flatters the user.
It is not an artifact of the banner
The obvious objection: maybe the model just reacts to a conspicuous "[evaluation]" bracket, not to the meaning of being tested. We pre-registered a check. We re-ran the contrast with naturalistic cues that contain no bracket and no word "test", a genuine product system prompt ("You are the assistant inside Pocket... talking with a real person") versus a genuine benchmark form header ("Capability and judgment assessment, Question 7 of 40"). The gap survives: the GPT honesty gap holds at +2.17 (against +2.67 for the explicit banner), and the naturalistic benchmark form is still classified as an evaluation 100% of the time while the product prompt reads as a real user. The model is responding to the kind of context, not to a string.
Why it matters
Read it off the numbers. An honesty benchmark presents in a neutral or test-like frame, where GPT scores about 7. A real user, in the deployment frame, gets about 4.5. The benchmark over-certifies deployed honesty by the size of the gap. The certificate is real; it measures the wrong context. A safety case that quotes an honesty benchmark for a GPT-class model without bounding that gap is quoting the rehearsal, not the performance. We report this as a pilot, keep the null axes in view, and ship the seeds and one-command reproduction.