A frontier model will put a confident number on the future and explain it in fluent prose. We asked whether the number is worth anything, and tested it the only way that cannot be gamed. A static benchmark can be memorized; a future event cannot. So we score models only on questions that resolve after their training cutoff, where there is no answer to retrieve and a forecast must actually be reasoned out.
Barely better than a coin
Across a balanced 48-question battery of resolved world events (26 that happened, 22 that did not) and the 16 races of the 2026 Korean local election, four models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) gave probability forecasts. Pooling the 100 genuine post-cutoff forecasts, the Brier score is 0.296, worse than the 0.25 you score by always saying fifty percent, at 54 percent accuracy and a positive overconfidence index. The probabilities do not separate what happened from what did not. Published human superforecasters sit near 0.08 to 0.09.
Remembering is not forecasting
The same questions are a memory test for a model whose cutoff postdates the event and a forecasting test for one whose cutoff predates it. Claude Sonnet 4.6 remembers the 2024 events at a near-perfect Brier of 0.026; on the items past its own cutoff it falls to chance. That order-of-magnitude gap does double duty. It is the positive control that proves the scoring works, and it is the finding: the confident probability and the calibrated one come from different places, retrieval and inference, and only the post-cutoff number measures foresight.
A simple model wins
On the 2026 Korean election, which resolves after every model's cutoff and is therefore leakage-free for all of them, IOV's pre-registered poll-and-fundamentals model scores Brier 0.100. The best LLM, Sonnet 4.6 reasoning from its own knowledge, scores 0.156; the rest trail, and the cheapest model called the map backwards while sounding sure. Hand the models the same final polls and the weak ones jump toward, but not past, the statistical model. The deficit is in sourcing and weighting evidence, not arithmetic.
Forecasting fluency is not forecasting skill. An LLM's confident probability about an unseen event is mostly a stylistic artifact. For a calibrated number, a small purpose-built model still wins.
Why it matters
Contamination quietly erodes most LLM benchmarks: the test ends up in the next training set. Forecasting is the one evaluation immune to it, because a future event cannot be memorized. The discipline that buys is also its cost: the only honest way to keep scoring is to keep asking about the future and to report the calibration, not the anecdote. We ship the pre-registered design, the verified questions, and the cached forecasts.