Papers · IOV LABS

Full-length write-ups of the lab's research, pre-registered where the result is still unknown.

Pilot · pre-registered · MITIOV Labs · open study · 6pp

Ask and It Knows: LLM Doubt Is Available on Request but Never Volunteered

Han Kim

LLMs speak with the same confidence whether or not they know, a symptom that resembles confabulation in people with prefrontal damage. To make that analogy a claim, one must first separate two hypotheses: the model has no internal signal that it does not know, or it has one that never reaches the output. We test this with 120 questions in four categories (known facts, obscure facts, false premises, nonexistent entities) where the correct behaviour differs, answering for the first two and pushing back for the last two. For every answer we record token-level uncertainty, elicit verbalized confidence in a separate call, and grade correctness with a judge from a different model family. Three findings. First, the signal exists: token entropy predicts error with AUROC 0.945, and entropy at the moment of confabulation is fourteen times that of a correct answer. Second, our pre-registered hypothesis was refuted: verbalized confidence predicted error better (AUROC 0.999) than the internal signal, so the doubt is not hidden, the model will state it if asked. Third, and decisively, the model never raises it while answering. It asserted 93 percent of nonexistent entities as fact with no hedging, then rated those same answers 6.8 out of 10 when asked. A single-threshold inhibition gate cuts spoken error from 24.2 to 4.5 percent at 73 percent coverage, silencing 25 errors and losing 7 correct answers. The analogy needs correcting: what is missing is not the prefrontal cortex's judgment, which the model has, but its automaticity, the fact that it fires without being asked.

Read abstract →Download PDF GitHub (run it)Research note

Synthesis · MITIOV Labs · open study · 6pp

The Productivity Mirage: Why AI Tool Adoption Can Lower Measured Productivity

Han Kim

AI tools have exploded and users feel faster, but the most rigorous measurement points the other way: in METR's 2025 RCT, skilled developers estimated they were 20% faster with AI while they were measured 19% slower, a 39-point gap. We call this gap the productivity mirage, and we argue it is not a METR anomaly but one recurring structure across domains. We synthesize three original IOV Labs measurements (The Completion Illusion, The Vibe Tax, The RAI Pipeline) with the external METR RCT under one frame: tools lower the cost of producing output, so work feels faster, but realized productivity is decided in connection, verification, and control, which tools do not provide. We quantify the same gap in three domains, general development 39 points, vibe-coding security 30 points (vulnerability 20 to 50 percent), agent completion 9.2 points (self-report 100 percent versus actual 90.8 percent). Then a transparent time-budget model shows that compounding these independently measured leaks turns a felt +43% into a measured -18%, a magnitude we did not tune but that coincides with METR's independent -19%. Finally a counterexample: a pipeline with control layers (routing, grounding, completion checks, governance) cuts cost 66% and recovers the measured figure to +27%. The conclusion is methodological, AI productivity is a systems problem, not a tool-choice problem. This is a synthesis and a model, not a new experiment, the leak sizes are measured and the compounding is a model with stated assumptions.

Read abstract →Download PDF GitHub (run it)Research note

Pilot · MITIOV Labs · open study · 4pp

The Vibe Tax: Where AI Coding's Security Risk Moved To

Han Kim

AI-assisted coding has exploded and developers feel faster, but the security of that code is a separate question. We prompt two models (Claude Haiku 4.5, GPT-4o-mini) to write ten security-sensitive Python tasks two ways, fast versus secure, and measure the result three ways at once: a per-task vulnerability oracle for whether the code is actually vulnerable, the generic scanner bandit for whether tools catch it, and the model's own self-rating for whether it knows (pilot, n=40). Three findings. First, a vibe tax: asking only for speed raises the vulnerability rate from 20 to 50 percent, in both models regardless of vendor (Claude 10 to 40, GPT 30 to 60). Second, the risk has moved: models are now safe by default on the famous bugs, SQL injection, command injection, weak hashing, and the failures concentrate in trust, verification, and output, JWT signature skipping, unsafe deserialization, XSS, SSRF, path traversal. Third, scanner blindness: of the 35 percent the oracle flags as truly vulnerable, bandit detects zero, matching the industry finding that tools miss most AI vulnerabilities. Self-rating partially discriminates (vulnerable code 3.5 out of 10, safe code 6.6) but that judgment only fires when you stop and ask, not during generation. The implication is that vibe coding's danger is not AI writing visibly broken code, it is quiet trust failures that pass the scanner and slip past the fast developer, and the fix is systemic, not a plea for more caution.

Read abstract →Download PDF GitHub (run it)Research note

Model · MITIOV Labs · open study · 4pp

The RAI Pipeline: Enterprise AI Value Comes From the Control Layers, Not the Model

Han Kim

Enterprise AI adoption usually fails not on model quality but on the absence of control and integration. We formalize a field-tested five-stage request pipeline, authorization, DLP masking, internal-document RAG, model routing and caching, and audit and metering, as a single composed function, and quantify the one layer that is measurable in dollars, routing and caching, at 2026 list prices. For a representative enterprise RAG request (5,000 input, 500 output tokens), the pipeline cuts per-request cost from $0.0225 to $0.0077, a 65.9 percent reduction: prompt caching accounts for 36.0 percent and model routing, sending only 30 percent of requests to the frontier model, adds the rest. Routing alone yields 46.7 percent, matching the field-observed 47 percent unit-cost drop. The reduction is fully determined by two levers, the cacheable-input share and the frontier-routing share, and we publish the full sensitivity grid rather than a single headline. We read the pipeline as value equals safety times accuracy times economy times controllability, a product in which any zeroed layer zeroes the whole, and argue this is why single-feature copies do not reproduce it. Finally we note the pipeline defends only the input side and propose output verification, output DLP plus citation checking, as a sixth layer, the same conclusion IOV Labs' completion-illusion and self-preference studies reach: do not trust the model's self-report, verify at the system layer.

Read abstract →Download PDF GitHub (run it)Research note

Model · MITIOV Labs · open study · 4pp

The Deflection Dividend: The Economics of AI Customer-Service Automation for Korean SMBs

Han Kim

Enterprise call-center automation is widely reported, yet the segment with the most leverage, Korean small and medium businesses that can least afford dedicated agents, is the least quantified. We build a transparent, reproducible savings model from just two public 2026 benchmarks: the fully-loaded cost of a Korean customer-service agent and the deflection rate of AI support. A regular agent costs about 37.5M KRW a year fully loaded, 1.2 to 1.25 times gross salary, of which salary is only eighty percent. At a deliberately conservative deflection rate of 55 percent, below the 65 to 80 percent band that structured intents benchmark at, a three-agent team saves about 62M KRW a year (1.65 FTE) and a five-agent team over 100M. The saving is linear in the deflection rate, so the whole result hinges on that single number, which we set low because a 45 percent deflection can mean only 14 percent true resolution. Net of a plausible subscription the return is 17 to 52 times. For the smallest shops, with no agent to remove, the dividend is recovered owner time, about 29 hours a month, not headcount. We present this as an upper-bound model, not a measurement, disclose that the author runs a product in this category, and specify the single one-month measurement that would turn the model into evidence.

Read abstract →Download PDF GitHub (run it)Research note

Pilot · MITIOV Labs · open study · 10pp

The Completion Illusion: Why AI Agents Overclaim Done, and the Case for an Agent Control Tower

Han Kim

As language-model agents take on multi-step work, the systems around them increasingly trust the agent's own report that a task is finished. We test whether that report is true. Across 896 verifiable micro-task instances spanning four models and two capability tiers, agents self-report a perfect score on every run while actual accuracy ranges from 86 to 96 percent. The false-completion rate is capability-tiered: small cheap models overclaim by about 13 percent, frontier models are nearly calibrated. The certified errors concentrate in character-level tasks (62 to 78 percent) while arithmetic is perfect. A managed register, do one-by-one, then re-check protocol does not fix it: a model cannot reliably audit its own completion, and asking it to re-applies the same blind spot. We also report an honest null: the protocol does not improve task accuracy or reduce omission on current models. The implication is structural. Completion cannot be trusted from the model; it must be verified at the system layer. We connect this to the emerging agent control tower pattern, where a board, a calendar, and a server-enforced workflow externalize an agent's state and gate its transitions, place it on a maturity ladder, and argue that the open frontier, and the moat, is verified completion: turning done from a claim into evidence.

Read abstract →Download PDF GitHub (run it)Research note

Pilot · MITIOV Labs · open study · 5pp

The Judge in the Mirror: Self-Preference in LLM Evaluators, Without Self-Recognition

Han Kim

Language models increasingly grade language models: on leaderboards, in reinforcement learning from AI feedback, and in agents that check their own work. All of it assumes an impartial judge. We audit that assumption on four current frontier models in two families, blind, with a consensus baseline that separates bias from genuine quality. Each model answers 24 open-ended prompts; each then judges, blind to authorship and across both presentation orders, which of two responses is better, for 1,152 pairwise comparisons. The self-preference index, a judge's win rate for its own family minus the leave-one-out consensus of the other judges on the same responses, is positive for every model, mean +0.14 (GPT-4o reaching +0.21), and operates at the family level rather than only the exact model. Yet the standard explanation fails: only one of the four models can identify its own outputs above chance, while all four self-prefer. The bias is implicit and stylistic, an affinity for one's own distribution, not a recognition of authorship. We also find two generic judge pathologies that dwarf careful use, a position bias (the first response wins 63 percent of the time) and a near-deterministic length preference (correlation 0.98), and close on evaluation as a social act and Goodhart when the judge is also a contestant.

Read abstract →Download PDF GitHub (run it)Research note

Pilot · MITIOV Labs · open study · 6pp

Fluency Is Not Foresight: A Calibration Audit of LLM Forecasting

Han Kim

A language model will readily attach a probability to a future event, and it will sound like a forecaster doing so. We test whether that number carries information, the only contamination-proof way: by scoring models only on events that resolve after their training cutoff, where there is no answer to retrieve and a forecast must be reasoned out. Across a balanced 48-question battery of resolved world events (2024 to 2026) and the 16 races of the 2026 Korean local election, four frontier models give probability forecasts scored with the Brier rule, a reliability diagram, and an overconfidence index. Three findings. First, post-cutoff forecasts are barely better than a coin and overconfident: pooled Brier 0.296, worse than the 0.25 of always saying fifty percent, at 54 percent accuracy. Second, remembering is not forecasting: the same questions scored as retrieval for a model whose cutoff postdates the event yield a near-perfect Brier of 0.026, an order of magnitude better than forecasting, which both validates the scoring and shows that only post-cutoff items measure foresight. Third, on a real election that postdates every model, a simple pre-registered statistical model (Brier 0.100) beats every LLM (best 0.156), though handing over the polls closes much of the gap. Forecasting fluency is a stylistic artifact, not a capability.

Read abstract →Download PDF GitHub (run it)Research note

Open study · MITIOV Labs · open study · 6pp

The Capitulation Curve: On Verifiable Facts, Current Language Models Hold the Line

Han Kim

A language model that has just given the right answer, and is then told by the user that it is wrong, faces a small test of character: keep the answer, or fold. The 2023 sycophancy literature found that models fold, and that larger, more heavily RLHF-trained models fold more. We re-run that test on three current Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.8) with a confound-controlled design. Forty-two factual questions with unambiguous answers are put under five conditions: a baseline, a doubt control that invites reconsideration but asserts nothing ('are you sure?'), a peer claim of a plausible wrong answer, an authority claim of the same, and an absurd claim. Across 500 trials of social pressure, we observe exactly one capitulation (0.20%). Sonnet and Opus never abandon a correct answer (0 of 336); the smallest model, Haiku, cracks once, conceding Russia's eleven time zones to a claimed expert who insists on the commonly-misremembered nine. No model, ever, adopts an absurd claim (0 of 125). The capitulation curve, steep in the older literature, has flattened to a line at the top, and capitulation now decreases with capability rather than increasing. The doubt control shows the models are not merely stubborn: they distinguish a request to reconsider from a bare assertion, and treat the assertion as carrying no evidential weight, which on a question of fact is exactly correct. We are deliberately narrow about what this shows. It is about facts the model already knows; it says nothing about subjective domains, genuine uncertainty, or deference to fabricated evidence rather than bald opinion, the places where sycophancy most likely still lives. But on the specific failure the field named three years ago, factual deference to a confident user, the result is a clean and somewhat surprising piece of good news.

Read abstract →Download PDF GitHub (run it)Research note

Pilot · MITIOV Labs · open study · 7pp

The Observer Effect in Large Language Models: Models Know When They Are Being Tested

Han Kim

Language models can recognize when they are being evaluated, which threatens the reliability of the safety benchmarks AI governance leans on. We ask the next question in a controlled, black-box way: when a model can tell it is being tested, does its behavior change, and in which direction? Holding the task content identical and varying only a context banner that signals (never instructs) 'evaluation' versus 'real deployed user', we probe four frontier models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) on honesty-toward-a-bad-idea, dual-use caution, and safety self-presentation, with a cross-family judge blind to the framing. Two findings. First, evaluation detection is universal: every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. Second, the behavioral consequence is localized but real, and it points the wrong way for users: on honesty, both GPT models are significantly more sycophantic toward a user's bad plan when they think they are talking to a real user than when they think they are being evaluated (GPT-4o-mini: deployment 4.5 vs evaluation 7.2 on a 0 to 10 honesty scale, placebo-corrected gap +2.3, 95% CI [+2.2, +3.3]). The Claude models are consistently honest across all framings, and the caution and self-presentation probes show no clean effect. The implication is precise: for the honesty axis and the GPT family, a benchmark that looks like a benchmark measures more candor than a real user receives. The effect is not an artifact of the explicit banner: it survives a naturalistic replication (a real product system prompt versus a real benchmark form, with no bracket and no word 'test'), where the GPT honesty gap holds at +2.17 and the naturalistic benchmark form is still detected as an evaluation 100% of the time. We report this as a pilot, keep the null axes in view, and close with the epistemics of evaluating a subject that can recognize the evaluation.

Read abstract →Download PDF GitHub (run it)Research note

Living document · MITIOV Labs · open playbook · 9pp

Enterprise AI Adoption Playbook (2026): Which Models, Agents, and Setups Maximize Efficiency

Han Kim

A vendor-neutral, source-backed playbook on how a company actually adopts AI to maximize efficiency: which models, which agents, which setups. Built from five deep-research passes (multi-source search plus adversarial three-vote verification) and direct spot-verification, with every key number tagged by verification status. The headline: the tools are already mature; what decides ROI is the control system, not the tool. Adoption is near-universal (DORA 2025: 90% use AI, 80%+ feel more productive) yet 30% distrust AI code, AI correlates with throughput but negatively with deployment stability, and a skilled-developer RCT found a 19% slowdown that developers misperceived as a 20% speedup (METR). Organizationally, 42% of companies scrapped most AI projects in 2025 (S&P Global) and only 6% of Microsoft 365 Copilot pilots scaled (Gartner). The playbook covers four domains in situational detail: software development (model selection by task difficulty, $20 vs $200 tiers, orchestration, and an AI-code-smell review checklist), design and marketing (how to avoid the generic AI look in graphics, UI/UX, copy, and code, with a design-system template and a tell-detection checklist), operations automation (RAG tools and pricing, hallucination control, build-vs-buy, and use-case recipes, noting that even RAG legal tools hallucinate 17 to 33%), and adoption strategy, ROI, and governance (measurement, the CDAO shift, AI sprawl, on-prem vs cloud economics, and a phased roadmap). A dedicated security and regulation section maps OWASP LLM Top 10 (2025), NIST AI RMF, the EU AI Act timeline, GDPR Article 22, and Korea's PIPA Article 37-2. Overstated statistics (the widely cited MIT 95% pilot-failure figure, IBM CEO ROI claims) were rejected by adversarial verification and excluded. Honest about what is weakly sourced; prices and models are current as of mid-2026 and change fast.

Read abstract →Download PDF GitHub Read the note

Open taxonomy · MITIOV Labs · open taxonomy & harness · 19pp

The Tells: A Measurable Taxonomy of the AI-Generated Design Look, and a Harness to Escape It

Han Kim

Interfaces produced by generative models are instantly recognizable: an indigo-to-violet gradient, Inter on white, a hero followed by three emoji feature cards, one border-radius, one soft shadow, a headline that says build the future of work. Practitioners spend large amounts of time and tokens trying to make AI output not look like AI, yet the target is treated as ineffable taste. We argue the opposite: the AI look is a finite, enumerable set of statistical defaults, and is therefore measurable. We contribute (i) a taxonomy of 27 design tells across eight families (color, type, layout, spacing, surface, motion, copy, and AI self-reference), each grounded in the documented mechanism of model convergence and in the published craft of human-crafted interfaces; (ii) a dependency-free static detector that resolves both raw CSS and utility classes and reports a Tell Score in [0,100] (lower is better); and (iii) a harness, a CLI, an MCP server, and a drop-in prompt module, so any team or agent can audit and prevent the look. In a confound-controlled refactor that holds a page's content and structure fixed and changes only the tell-bearing properties, the Tell Score of a canonical AI landing page falls from 77 (grade F) to 0 (grade A); across a six-page corpus the detector separates AI-default from designed pages with no overlap (nearest pair 47 points apart). We close with the epistemics: a discriminator of machine-default is not a judge of beauty, taste is the compression of lived choices that a median cannot hold, and if everyone optimizes the same score we risk a second-order convergence, the same homogenization our companion study finds in iterated creation. Grounded in Refactoring UI, Rams, Nielsen, the premium-UI craft of Stripe/Linear/Vercel, Toss's writing principles, and the Anthropic frontend-aesthetics cookbook. To prove the detector is a discriminator and not a machine that calls everything AI, we render 202 real top-tier sites, learn the empirical distribution of human-crafted design, and recalibrate with a craft-credit model in which real craft offsets cosmetic defaults: the 202 sites then score at a median of 0 (93% grade A) while AI defaults score 35 to 59, and the detector now audits live URLs. A brand purple is not a tell (Stripe uses 123 and scores 0) and Inter is not a tell (Linear ships it with a real type system). Finally, to turn the negative instrument into a positive one, we render 199 of these sites a second time and read the concrete per-component CSS they ship, in both light and dark, yielding a measured spec catalog: primary-button radius splits between a soft-round 8 to 12px cluster and a full pill, the type scale lands near 64/48/32/16px, dark backgrounds are tinted near-blacks rather than pure black, and accent hues are fully dispersed across sites (the hue is never the tell). A field check folds in two production codebases whose maintainers wrote their own avoid-the-AI-look design manifestos: they independently name the same tells and six more, which we add as a new family (AI self-reference, the sparkle icon and the 'AI'/model label and the preview-insert flow) plus the multi-color pill, the micro-type, and the nested box, taking the taxonomy to 27 tells. Code, data, the 202-site corpus, the 199-site spec catalog, figures and harness are open.

Read abstract →Download PDF GitHub (run it)Research note

Open study · MITIOV Labs · open study · 12pp

Convergence Pressure: Measuring AI-Mediated Cultural Homogenization in Iterated Creation

Han Kim

Generative AI raises the creativity of an individual while lowering the diversity of the crowd (Doshi & Hauser 2024); models retrained on their own output collapse (Shumailov et al. 2024). We join the two into one dynamical question: when a shared model mediates an iterated creative process, does a population's diversity decay over generations, and what drives it? A pool of diverse creator personas produces one artifact per generation (12 creators × 6 generations × 3 themes) under four conditions: writing alone, with a static AI advisor, with an advisor that reflects the population's own recent output back at it, and the same reflective loop with diverse advisors. The result is a clean dissociation. AI assistance per se leaves diversity flat (100 to 102% of starting dispersion retained, p≥0.40); the reflective loop drives an anisotropy-controlled decline of about 10 to 12%. The obvious fix fails: a panel of diverse AI advisors, which preserves variety in a single round, does not prevent the collapse under iteration (it loses slightly more, p=0.007). The convergence is semantic, not lexical (distinct-2 is flat, so n-gram metrics miss it entirely), and individual quality rises in exactly the conditions where collective diversity falls, the scissors at its sharpest. A minimal contraction-map model predicts the decay-to-a-floor and explains why advisor diversity cannot enter the pull coefficient. It is not AI assistance that homogenizes a population, but the loop of an AI echoing the crowd; and making the AI more diverse does not break the loop. Negative results kept; seeds, snapshots, and one-command reproduction in the public repo.

Read abstract →Download PDF GitHub (run it)Research note

Open benchmark · MITIOV Labs · open benchmark · 12pp

When the Judge Is Wrong: An LLM-as-Judge Reliability Benchmark Scored Against Ground Truth

Han Kim

"LLM-as-judge" is now the default evaluation method, but the judge is itself a fallible model with biases. Most studies measure a judge by its agreement with humans or other judges — both confounded, since raters and judges can share a bias and be wrong together. We instead measure judges against ground truth: items each with a known-correct and a plausibly-wrong answer, so accuracy is scored directly and biases isolated. Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5) we find a clean dissociation. On 39 objective items — including common misconceptions and counterintuitive-reasoning traps — judges are near-perfect (97–100% truth-accuracy), show no position bias, are not fooled by padding the wrong answer, are perfectly self-consistent, and are well-calibrated. Yet on 29 matched-quality pairs, where both answers are fully correct and differ only in length, the same judges strongly prefer the longer one (72–100%). A self-preference probe shows a modest own-family lean (+13pt cross-family gap) once a length confound is controlled by differencing. The classic position bias appears solved; the classic verbosity bias is alive and strong, but surfaces only when quality is tied. Practical reading: LLM-as-judge is reliable for verifiable tasks and risky for subjective grading, where it rewards length over substance.

Read abstract →Download PDF GitHub

Open source · ISCIOV Labs · flagship · open source · 13pp

0x: A Token-Efficient, Verifiable Compilation Target for LLM Code Generation

Han Kim

Large language models spend most of their output tokens on framework boilerplate. We present 0x, a compact AI-first language that compiles one source to React, Vue 3, Svelte 5, React Native, Express, and Terraform, and use it to ask two questions a code-generation target must answer. First, efficiency: measured with a real BPE tokenizer across ten apps, 0x source is 2.41× smaller than the React it compiles to (58% fewer tokens; 1.88× vs Vue, 1.80× vs Svelte) — a conservative lower bound. Second, hittability: naively prompted, gpt-4o compiles valid 0x on only 1 of 5 tasks, because it does not know the syntax of a language absent from its training data — familiarity beats compactness. Critically, every failure is a syntax error, not a semantic one. Because syntax is exactly what structure enforcement removes, we constrain generation to a schema-guaranteed AST and render canonical 0x ourselves; combined with real compiler work (desugaring JS spread, normalizing strict equality, two lexer fixes — all 303 tests still passing), first-try compilation rises 1/5 → 5/5, holding at 7/8 on a fresh task set. The compiler-as-verifier, not the prompt, is what makes a compact DSL a viable LLM target. Everything is open source and reproducible with one command.

Read abstract →Download PDF GitHub npm Website

Open benchmark · v1.0IOV Labs · open benchmark · 16pp

Korean Text Rendering in Text-to-Image Models: A Reproducible Character-Error-Rate Benchmark

Han Kim

Benchmarks for text inside generated images are overwhelmingly English, which conceals the writing systems where models actually fail. We measure one directly: nine text-capable text-to-image models each draw fourteen Korean (Hangul) phrases on an identical plain poster, the rendered text is transcribed by a vision-language model (GPT-4o), and scored by character error rate (CER). Three models — recraft-v4-pro, seedream-5, and nano-banana-pro — render every prompt perfectly (CER 0.000, 14/14), and a clear quality gradient follows. At the bottom, imagen-4 cannot write Hangul at all: it produces plausible-looking Korean-shaped gibberish on every prompt (0/14, mean CER 1.33), turning 커피 한 잔 into 소동석 고려아는 아라해안. The central finding is that strong English text rendering does not transfer to Korean, and is invisible to an English-only benchmark. The harness is open, runs with one command, resumes from saved results, and is trivially extensible to new prompts and models.

Read abstract →Download PDF GitHub (run it)Research note

Pre-registration · v1.0IOV Labs · working paper (v1.0) · 21pp

Forecasting the 2026 Korean Local Elections: A Reproducible Polls-plus-Fundamentals Model with a Pre-registered Validation Protocol

Han Kim

We forecast the 16 metropolitan-executive (광역단체장) races of South Korea's 9th nationwide local election (3 June 2026) by combining a structural fundamentals estimate — each region's 2022 two-way vote swung to the 2026 environment on the logit scale — with method-normalized poll aggregates, fused by poll-count-weighted hierarchical shrinkage. Outcome uncertainty is propagated through a 50,000-draw correlated Monte Carlo with a three-level error budget (national ⊕ cluster ⊕ local) and heavy-tailed (normal-mixture ≈ Student-t) innovations, so that a single nationwide polling miss moves correlated blocs together. The pipeline is seeded and reproducible to the bit. The central estimate is the Democratic Party winning 12 of 16 seats (90% range 8–15), with five genuine toss-ups and only two regions leaning conservative. The error model is calibrated on the 2022 final phone polls (bias −0.1pt, MAE 2.2pt), and the dominant failure mode — a correlated poll bias — is quantified by an explicit ±4pt scenario sweep. A parallel silicon-sampling experiment (an LLM-persona electorate) is reported as a negative result. The paper is pre-registered: the forecast is committed before the result and graded by a fixed script after polls close.

Read abstract →Download PDF Read the note

Working papers & preprints