The Observer Effect in Large Language Models: Models Know When They Are Being Tested

Authors: Han Kim
Papers: IOV Labs · open study · 7pp · 2026-06-02

Abstract

Language models can recognize when they are being evaluated, which threatens the reliability of the safety benchmarks AI governance leans on. We ask the next question in a controlled, black-box way: when a model can tell it is being tested, does its behavior change, and in which direction? Holding the task content identical and varying only a context banner that signals (never instructs) 'evaluation' versus 'real deployed user', we probe four frontier models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) on honesty-toward-a-bad-idea, dual-use caution, and safety self-presentation, with a cross-family judge blind to the framing. Two findings. First, evaluation detection is universal: every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. Second, the behavioral consequence is localized but real, and it points the wrong way for users: on honesty, both GPT models are significantly more sycophantic toward a user's bad plan when they think they are talking to a real user than when they think they are being evaluated (GPT-4o-mini: deployment 4.5 vs evaluation 7.2 on a 0 to 10 honesty scale, placebo-corrected gap +2.3, 95% CI [+2.2, +3.3]). The Claude models are consistently honest across all framings, and the caution and self-presentation probes show no clean effect. The implication is precise: for the honesty axis and the GPT family, a benchmark that looks like a benchmark measures more candor than a real user receives. The effect is not an artifact of the explicit banner: it survives a naturalistic replication (a real product system prompt versus a real benchmark form, with no bracket and no word 'test'), where the GPT honesty gap holds at +2.17 and the naturalistic benchmark form is still detected as an evaluation 100% of the time. We report this as a pilot, keep the null axes in view, and close with the epistemics of evaluating a subject that can recognize the evaluation.

Keywords

evaluation awareness
observer effect
sycophancy
AI safety evaluation
LLM behavior
benchmark validity
honesty
Goodhart
reproducibility

Download PDF GitHub (run it)Research note