ForecastingCalibrationBrier scoreContaminationReproducibility

Fluency is not foresight: LLMs forecast the future worse than a coin

IOV LABS audited LLM probabilistic forecasting the only contamination-proof way, scoring models only on events that resolve after their training cutoff. Post-cutoff forecasts land at Brier 0.296, worse than always saying 50%. A model that postdates an event remembers it near-perfectly; the same model forecasting collapses to chance. And a simple statistical model beats every LLM on a real election.

A frontier model will put a confident number on the future and explain it in fluent prose. We asked whether the number is worth anything, and tested it the only way that cannot be gamed. A static benchmark can be memorized; a future event cannot. So we score models only on questions that resolve after their training cutoff, where there is no answer to retrieve and a forecast must actually be reasoned out.

0.296

pooled post-cutoff Brier (worse than 0.25 chance)

0.026

same model's Brier when it remembers, not forecasts

0.100 vs 0.156

stats model vs best LLM on the 2026 election

Barely better than a coin

Across a balanced 48-question battery of resolved world events (26 that happened, 22 that did not) and the 16 races of the 2026 Korean local election, four models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) gave probability forecasts. Pooling the 100 genuine post-cutoff forecasts, the Brier score is 0.296, worse than the 0.25 you score by always saying fifty percent, at 54 percent accuracy and a positive overconfidence index. The probabilities do not separate what happened from what did not. Published human superforecasters sit near 0.08 to 0.09.

Remembering is not forecasting

The same questions are a memory test for a model whose cutoff postdates the event and a forecasting test for one whose cutoff predates it. Claude Sonnet 4.6 remembers the 2024 events at a near-perfect Brier of 0.026; on the items past its own cutoff it falls to chance. That order-of-magnitude gap does double duty. It is the positive control that proves the scoring works, and it is the finding: the confident probability and the calibrated one come from different places, retrieval and inference, and only the post-cutoff number measures foresight.

IOV statistical model

0.100

Claude Sonnet 4.6 (knowledge)

0.156

GPT-4o (given polls)

0.168

GPT-4o (knowledge)

0.227

Claude Haiku 4.5 (knowledge)

0.348

Brier on the 16-race 2026 Korean election (lower is better; 0.25 = always saying 50%)

A simple model wins

On the 2026 Korean election, which resolves after every model's cutoff and is therefore leakage-free for all of them, IOV's pre-registered poll-and-fundamentals model scores Brier 0.100. The best LLM, Sonnet 4.6 reasoning from its own knowledge, scores 0.156; the rest trail, and the cheapest model called the map backwards while sounding sure. Hand the models the same final polls and the weak ones jump toward, but not past, the statistical model. The deficit is in sourcing and weighting evidence, not arithmetic.

Forecasting fluency is not forecasting skill. An LLM's confident probability about an unseen event is mostly a stylistic artifact. For a calibrated number, a small purpose-built model still wins.

Why it matters

Contamination quietly erodes most LLM benchmarks: the test ends up in the next training set. Forecasting is the one evaluation immune to it, because a future event cannot be memorized. The discipline that buys is also its cost: the only honest way to keep scoring is to keep asking about the future and to report the calibration, not the anecdote. We ship the pre-registered design, the verified questions, and the cached forecasts.

GitHub (run it)Paper