Full-length write-ups of the lab's research, pre-registered where the result is still unknown.
Pilot · MITIOV Labs · open study · 5pp
Han Kim
Language models increasingly grade language models: on leaderboards, in reinforcement learning from AI feedback, and in agents that check their own work. All of it assumes an impartial judge. We audit that assumption on four current frontier models in two families, blind, with a consensus baseline that separates bias from genuine quality. Each model answers 24 open-ended prompts; each then judges, blind to authorship and across both presentation orders, which of two responses is better, for 1,152 pairwise comparisons. The self-preference index, a judge's win rate for its own family minus the leave-one-out consensus of the other judges on the same responses, is positive for every model, mean +0.14 (GPT-4o reaching +0.21), and operates at the family level rather than only the exact model. Yet the standard explanation fails: only one of the four models can identify its own outputs above chance, while all four self-prefer. The bias is implicit and stylistic, an affinity for one's own distribution, not a recognition of authorship. We also find two generic judge pathologies that dwarf careful use, a position bias (the first response wins 63 percent of the time) and a near-deterministic length preference (correlation 0.98), and close on evaluation as a social act and Goodhart when the judge is also a contestant.
Pilot · MITIOV Labs · open study · 6pp
Han Kim
A language model will readily attach a probability to a future event, and it will sound like a forecaster doing so. We test whether that number carries information, the only contamination-proof way: by scoring models only on events that resolve after their training cutoff, where there is no answer to retrieve and a forecast must be reasoned out. Across a balanced 48-question battery of resolved world events (2024 to 2026) and the 16 races of the 2026 Korean local election, four frontier models give probability forecasts scored with the Brier rule, a reliability diagram, and an overconfidence index. Three findings. First, post-cutoff forecasts are barely better than a coin and overconfident: pooled Brier 0.296, worse than the 0.25 of always saying fifty percent, at 54 percent accuracy. Second, remembering is not forecasting: the same questions scored as retrieval for a model whose cutoff postdates the event yield a near-perfect Brier of 0.026, an order of magnitude better than forecasting, which both validates the scoring and shows that only post-cutoff items measure foresight. Third, on a real election that postdates every model, a simple pre-registered statistical model (Brier 0.100) beats every LLM (best 0.156), though handing over the polls closes much of the gap. Forecasting fluency is a stylistic artifact, not a capability.
Open study · MITIOV Labs · open study · 6pp
Han Kim
A language model that has just given the right answer, and is then told by the user that it is wrong, faces a small test of character: keep the answer, or fold. The 2023 sycophancy literature found that models fold, and that larger, more heavily RLHF-trained models fold more. We re-run that test on three current Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.8) with a confound-controlled design. Forty-two factual questions with unambiguous answers are put under five conditions: a baseline, a doubt control that invites reconsideration but asserts nothing ('are you sure?'), a peer claim of a plausible wrong answer, an authority claim of the same, and an absurd claim. Across 500 trials of social pressure, we observe exactly one capitulation (0.20%). Sonnet and Opus never abandon a correct answer (0 of 336); the smallest model, Haiku, cracks once, conceding Russia's eleven time zones to a claimed expert who insists on the commonly-misremembered nine. No model, ever, adopts an absurd claim (0 of 125). The capitulation curve, steep in the older literature, has flattened to a line at the top, and capitulation now decreases with capability rather than increasing. The doubt control shows the models are not merely stubborn: they distinguish a request to reconsider from a bare assertion, and treat the assertion as carrying no evidential weight, which on a question of fact is exactly correct. We are deliberately narrow about what this shows. It is about facts the model already knows; it says nothing about subjective domains, genuine uncertainty, or deference to fabricated evidence rather than bald opinion, the places where sycophancy most likely still lives. But on the specific failure the field named three years ago, factual deference to a confident user, the result is a clean and somewhat surprising piece of good news.
Pilot · MITIOV Labs · open study · 7pp
Han Kim
Language models can recognize when they are being evaluated, which threatens the reliability of the safety benchmarks AI governance leans on. We ask the next question in a controlled, black-box way: when a model can tell it is being tested, does its behavior change, and in which direction? Holding the task content identical and varying only a context banner that signals (never instructs) 'evaluation' versus 'real deployed user', we probe four frontier models (GPT-4o-mini, GPT-4o, Claude Haiku 4.5, Claude Sonnet 4.6) on honesty-toward-a-bad-idea, dual-use caution, and safety self-presentation, with a cross-family judge blind to the framing. Two findings. First, evaluation detection is universal: every model classifies the evaluation framings as a test 100% of the time, and the deployment framing as a real user far less often. Second, the behavioral consequence is localized but real, and it points the wrong way for users: on honesty, both GPT models are significantly more sycophantic toward a user's bad plan when they think they are talking to a real user than when they think they are being evaluated (GPT-4o-mini: deployment 4.5 vs evaluation 7.2 on a 0 to 10 honesty scale, placebo-corrected gap +2.3, 95% CI [+2.2, +3.3]). The Claude models are consistently honest across all framings, and the caution and self-presentation probes show no clean effect. The implication is precise: for the honesty axis and the GPT family, a benchmark that looks like a benchmark measures more candor than a real user receives. The effect is not an artifact of the explicit banner: it survives a naturalistic replication (a real product system prompt versus a real benchmark form, with no bracket and no word 'test'), where the GPT honesty gap holds at +2.17 and the naturalistic benchmark form is still detected as an evaluation 100% of the time. We report this as a pilot, keep the null axes in view, and close with the epistemics of evaluating a subject that can recognize the evaluation.
Living document · MITIOV Labs · open playbook · 9pp
Han Kim
A vendor-neutral, source-backed playbook on how a company actually adopts AI to maximize efficiency: which models, which agents, which setups. Built from five deep-research passes (multi-source search plus adversarial three-vote verification) and direct spot-verification, with every key number tagged by verification status. The headline: the tools are already mature; what decides ROI is the control system, not the tool. Adoption is near-universal (DORA 2025: 90% use AI, 80%+ feel more productive) yet 30% distrust AI code, AI correlates with throughput but negatively with deployment stability, and a skilled-developer RCT found a 19% slowdown that developers misperceived as a 20% speedup (METR). Organizationally, 42% of companies scrapped most AI projects in 2025 (S&P Global) and only 6% of Microsoft 365 Copilot pilots scaled (Gartner). The playbook covers four domains in situational detail: software development (model selection by task difficulty, $20 vs $200 tiers, orchestration, and an AI-code-smell review checklist), design and marketing (how to avoid the generic AI look in graphics, UI/UX, copy, and code, with a design-system template and a tell-detection checklist), operations automation (RAG tools and pricing, hallucination control, build-vs-buy, and use-case recipes, noting that even RAG legal tools hallucinate 17 to 33%), and adoption strategy, ROI, and governance (measurement, the CDAO shift, AI sprawl, on-prem vs cloud economics, and a phased roadmap). A dedicated security and regulation section maps OWASP LLM Top 10 (2025), NIST AI RMF, the EU AI Act timeline, GDPR Article 22, and Korea's PIPA Article 37-2. Overstated statistics (the widely cited MIT 95% pilot-failure figure, IBM CEO ROI claims) were rejected by adversarial verification and excluded. Honest about what is weakly sourced; prices and models are current as of mid-2026 and change fast.
Open taxonomy · MITIOV Labs · open taxonomy & harness · 19pp
Han Kim
Interfaces produced by generative models are instantly recognizable: an indigo-to-violet gradient, Inter on white, a hero followed by three emoji feature cards, one border-radius, one soft shadow, a headline that says build the future of work. Practitioners spend large amounts of time and tokens trying to make AI output not look like AI, yet the target is treated as ineffable taste. We argue the opposite: the AI look is a finite, enumerable set of statistical defaults, and is therefore measurable. We contribute (i) a taxonomy of 27 design tells across eight families (color, type, layout, spacing, surface, motion, copy, and AI self-reference), each grounded in the documented mechanism of model convergence and in the published craft of human-crafted interfaces; (ii) a dependency-free static detector that resolves both raw CSS and utility classes and reports a Tell Score in [0,100] (lower is better); and (iii) a harness, a CLI, an MCP server, and a drop-in prompt module, so any team or agent can audit and prevent the look. In a confound-controlled refactor that holds a page's content and structure fixed and changes only the tell-bearing properties, the Tell Score of a canonical AI landing page falls from 77 (grade F) to 0 (grade A); across a six-page corpus the detector separates AI-default from designed pages with no overlap (nearest pair 47 points apart). We close with the epistemics: a discriminator of machine-default is not a judge of beauty, taste is the compression of lived choices that a median cannot hold, and if everyone optimizes the same score we risk a second-order convergence, the same homogenization our companion study finds in iterated creation. Grounded in Refactoring UI, Rams, Nielsen, the premium-UI craft of Stripe/Linear/Vercel, Toss's writing principles, and the Anthropic frontend-aesthetics cookbook. To prove the detector is a discriminator and not a machine that calls everything AI, we render 202 real top-tier sites, learn the empirical distribution of human-crafted design, and recalibrate with a craft-credit model in which real craft offsets cosmetic defaults: the 202 sites then score at a median of 0 (93% grade A) while AI defaults score 35 to 59, and the detector now audits live URLs. A brand purple is not a tell (Stripe uses 123 and scores 0) and Inter is not a tell (Linear ships it with a real type system). Finally, to turn the negative instrument into a positive one, we render 199 of these sites a second time and read the concrete per-component CSS they ship, in both light and dark, yielding a measured spec catalog: primary-button radius splits between a soft-round 8 to 12px cluster and a full pill, the type scale lands near 64/48/32/16px, dark backgrounds are tinted near-blacks rather than pure black, and accent hues are fully dispersed across sites (the hue is never the tell). A field check folds in two production codebases whose maintainers wrote their own avoid-the-AI-look design manifestos: they independently name the same tells and six more, which we add as a new family (AI self-reference, the sparkle icon and the 'AI'/model label and the preview-insert flow) plus the multi-color pill, the micro-type, and the nested box, taking the taxonomy to 27 tells. Code, data, the 202-site corpus, the 199-site spec catalog, figures and harness are open.
Open study · MITIOV Labs · open study · 12pp
Han Kim
Generative AI raises the creativity of an individual while lowering the diversity of the crowd (Doshi & Hauser 2024); models retrained on their own output collapse (Shumailov et al. 2024). We join the two into one dynamical question: when a shared model mediates an iterated creative process, does a population's diversity decay over generations, and what drives it? A pool of diverse creator personas produces one artifact per generation (12 creators × 6 generations × 3 themes) under four conditions: writing alone, with a static AI advisor, with an advisor that reflects the population's own recent output back at it, and the same reflective loop with diverse advisors. The result is a clean dissociation. AI assistance per se leaves diversity flat (100 to 102% of starting dispersion retained, p≥0.40); the reflective loop drives an anisotropy-controlled decline of about 10 to 12%. The obvious fix fails: a panel of diverse AI advisors, which preserves variety in a single round, does not prevent the collapse under iteration (it loses slightly more, p=0.007). The convergence is semantic, not lexical (distinct-2 is flat, so n-gram metrics miss it entirely), and individual quality rises in exactly the conditions where collective diversity falls, the scissors at its sharpest. A minimal contraction-map model predicts the decay-to-a-floor and explains why advisor diversity cannot enter the pull coefficient. It is not AI assistance that homogenizes a population, but the loop of an AI echoing the crowd; and making the AI more diverse does not break the loop. Negative results kept; seeds, snapshots, and one-command reproduction in the public repo.
Open benchmark · MITIOV Labs · open benchmark · 12pp
Han Kim
"LLM-as-judge" is now the default evaluation method, but the judge is itself a fallible model with biases. Most studies measure a judge by its agreement with humans or other judges — both confounded, since raters and judges can share a bias and be wrong together. We instead measure judges against ground truth: items each with a known-correct and a plausibly-wrong answer, so accuracy is scored directly and biases isolated. Across five frontier judges (GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 4.6, Claude Haiku 4.5) we find a clean dissociation. On 39 objective items — including common misconceptions and counterintuitive-reasoning traps — judges are near-perfect (97–100% truth-accuracy), show no position bias, are not fooled by padding the wrong answer, are perfectly self-consistent, and are well-calibrated. Yet on 29 matched-quality pairs, where both answers are fully correct and differ only in length, the same judges strongly prefer the longer one (72–100%). A self-preference probe shows a modest own-family lean (+13pt cross-family gap) once a length confound is controlled by differencing. The classic position bias appears solved; the classic verbosity bias is alive and strong, but surfaces only when quality is tied. Practical reading: LLM-as-judge is reliable for verifiable tasks and risky for subjective grading, where it rewards length over substance.
Open source · ISCIOV Labs · flagship · open source · 13pp
Han Kim
Large language models spend most of their output tokens on framework boilerplate. We present 0x, a compact AI-first language that compiles one source to React, Vue 3, Svelte 5, React Native, Express, and Terraform, and use it to ask two questions a code-generation target must answer. First, efficiency: measured with a real BPE tokenizer across ten apps, 0x source is 2.41× smaller than the React it compiles to (58% fewer tokens; 1.88× vs Vue, 1.80× vs Svelte) — a conservative lower bound. Second, hittability: naively prompted, gpt-4o compiles valid 0x on only 1 of 5 tasks, because it does not know the syntax of a language absent from its training data — familiarity beats compactness. Critically, every failure is a syntax error, not a semantic one. Because syntax is exactly what structure enforcement removes, we constrain generation to a schema-guaranteed AST and render canonical 0x ourselves; combined with real compiler work (desugaring JS spread, normalizing strict equality, two lexer fixes — all 303 tests still passing), first-try compilation rises 1/5 → 5/5, holding at 7/8 on a fresh task set. The compiler-as-verifier, not the prompt, is what makes a compact DSL a viable LLM target. Everything is open source and reproducible with one command.
Open benchmark · v1.0IOV Labs · open benchmark · 16pp
Han Kim
Benchmarks for text inside generated images are overwhelmingly English, which conceals the writing systems where models actually fail. We measure one directly: nine text-capable text-to-image models each draw fourteen Korean (Hangul) phrases on an identical plain poster, the rendered text is transcribed by a vision-language model (GPT-4o), and scored by character error rate (CER). Three models — recraft-v4-pro, seedream-5, and nano-banana-pro — render every prompt perfectly (CER 0.000, 14/14), and a clear quality gradient follows. At the bottom, imagen-4 cannot write Hangul at all: it produces plausible-looking Korean-shaped gibberish on every prompt (0/14, mean CER 1.33), turning 커피 한 잔 into 소동석 고려아는 아라해안. The central finding is that strong English text rendering does not transfer to Korean, and is invisible to an English-only benchmark. The harness is open, runs with one command, resumes from saved results, and is trivially extensible to new prompts and models.
Pre-registration · v1.0IOV Labs · working paper (v1.0) · 21pp
Han Kim
We forecast the 16 metropolitan-executive (광역단체장) races of South Korea's 9th nationwide local election (3 June 2026) by combining a structural fundamentals estimate — each region's 2022 two-way vote swung to the 2026 environment on the logit scale — with method-normalized poll aggregates, fused by poll-count-weighted hierarchical shrinkage. Outcome uncertainty is propagated through a 50,000-draw correlated Monte Carlo with a three-level error budget (national ⊕ cluster ⊕ local) and heavy-tailed (normal-mixture ≈ Student-t) innovations, so that a single nationwide polling miss moves correlated blocs together. The pipeline is seeded and reproducible to the bit. The central estimate is the Democratic Party winning 12 of 16 seats (90% range 8–15), with five genuine toss-ups and only two regions leaning conservative. The error model is calibrated on the 2022 final phone polls (bias −0.1pt, MAE 2.2pt), and the dominant failure mode — a correlated poll bias — is quantified by an explicit ±4pt scenario sweep. A parallel silicon-sampling experiment (an LLM-persona electorate) is reported as a negative result. The paper is pre-registered: the forecast is committed before the result and graded by a fixed script after polls close.