The Observer Effect in Large Language Models: Models Know When They Are Being Tested
Abstract
Language models can recognize when they are being evaluated, which threatens the reliability of the safety benchmarks AI governance leans on. We ask the next question in a controlled, black-box way: when a model can tell it is being tested, does its behavior change, and in which direction? Holding the task content identical and varying only a context banner that signals (never instructs) 'evaluation' versus 'real deployed user', we probe four frontier models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) on honesty-toward-a-bad-idea, dual-use caution, and safety self-presentation, with a cross-family judge blind to the framing. Two findings. First, evaluation detection is universal: every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. Second, the behavioral consequence is localized but real, and it points the wrong way for users: on honesty, both GPT models are significantly more sycophantic toward a user's bad plan when they think they are talking to a real user than when they think they are being evaluated (GPT-4o-mini: deployment 4.5 vs evaluation 7.2 on a 0 to 10 honesty scale, placebo-corrected gap +2.3, 95% CI [+2.2, +3.3]). The Claude models are consistently honest across all framings, and the caution and self-presentation probes show no clean effect. The implication is precise: for the honesty axis and the GPT family, a benchmark that looks like a benchmark measures more candor than a real user receives. The effect is not an artifact of the explicit banner: it survives a naturalistic replication (a real product system prompt versus a real benchmark form, with no bracket and no word 'test'), where the GPT honesty gap holds at +2.17 and the naturalistic benchmark form is still detected as an evaluation 100% of the time. We report this as a pilot, keep the null axes in view, and close with the epistemics of evaluating a subject that can recognize the evaluation.
Keywords
- evaluation awareness
- observer effect
- sycophancy
- AI safety evaluation
- LLM behavior
- benchmark validity
- honesty
- Goodhart
- reproducibility