Fluency Is Not Foresight: A Calibration Audit of LLM Forecasting
Abstract
A language model will readily attach a probability to a future event, and it will sound like a forecaster doing so. We test whether that number carries information, the only contamination-proof way: by scoring models only on events that resolve after their training cutoff, where there is no answer to retrieve and a forecast must be reasoned out. Across a balanced 48-question battery of resolved world events (2024 to 2026) and the 16 races of the 2026 Korean local election, four frontier models give probability forecasts scored with the Brier rule, a reliability diagram, and an overconfidence index. Three findings. First, post-cutoff forecasts are barely better than a coin and overconfident: pooled Brier 0.296, worse than the 0.25 of always saying fifty percent, at 54 percent accuracy. Second, remembering is not forecasting: the same questions scored as retrieval for a model whose cutoff postdates the event yield a near-perfect Brier of 0.026, an order of magnitude better than forecasting, which both validates the scoring and shows that only post-cutoff items measure foresight. Third, on a real election that postdates every model, a simple pre-registered statistical model (Brier 0.100) beats every LLM (best 0.156), though handing over the polls closes much of the gap. Forecasting fluency is a stylistic artifact, not a capability.
Keywords
- forecasting
- calibration
- Brier score
- contamination
- knowledge cutoff
- proper scoring rules
- single-event probability
- LLM evaluation
- reproducibility