News

Latest from the lab

·2 min read

The judge in the mirror: LLM evaluators favor their own family, but cannot see why

IOV LABS audited self-preference in LLM-as-judge on four current frontier models, blind, with a consensus baseline that separates bias from quality. Every judge inflates its own family over the neutral panel by about 14 points. But the standard story is wrong: only one model recognizes its own outputs, yet all four self-prefer. The bias is implicit. Beneath it, the first-shown answer wins 63% of the time and length predicts the verdict at 0.98.

LLM-as-judgeSelf-preferenceLeaderboards
Read more
·2 min read

Fluency is not foresight: LLMs forecast the future worse than a coin

IOV LABS audited LLM probabilistic forecasting the only contamination-proof way, scoring models only on events that resolve after their training cutoff. Post-cutoff forecasts land at Brier 0.296, worse than always saying 50%. A model that postdates an event remembers it near-perfectly; the same model forecasting collapses to chance. And a simple statistical model beats every LLM on a real election.

ForecastingCalibrationBrier score
Read more
·4 min read

A model that knows it is right does not fold, even when you swear it is wrong

IOV LABS re-ran the 2023 sycophancy test on three current Claude models with a control that separates reconsideration from deference. Across 500 trials of social pressure on facts the models knew, there was exactly one capitulation. Sonnet 4.6 and Opus 4.8 never abandoned a correct answer; no model ever accepted an absurd one. Capitulation now decreases with capability, inverting the older result.

SycophancyLLM behaviorAI safety
Read more
·3 min read

LLMs know when they're being tested, and GPT gets more honest because of it

IOV LABS ran a controlled, black-box study of evaluation-elastic behavior. Every model recognizes an evaluation framing 100% of the time. The behavioral shift is localized but real and user-adverse: GPT is more sycophantic toward a real user than a grader, so an honesty benchmark over-certifies what a user actually gets.

Evaluation awarenessAI safetySycophancy
Read more
·2 min read

The tools are mature. ROI is decided by the control system, not the tool.

IOV LABS built a vendor-neutral, source-backed playbook on how a company actually adopts AI to maximize efficiency: which models, agents, and setups. Five deep-research passes with adversarial verification across dev, design, ops, governance, and security. Overstated stats were rejected.

Enterprise AIAI adoptionCoding agents
Read more
·10 min read

The AI design look is not taste. It is a finite set of defaults, and we measured it.

IOV LABS taxonomized the AI-generated design look into 27 measurable tells across eight families and built a transparent detector, the Tell Score. Holding a page's content fixed and changing only the tell-bearing choices drops the score from 77 (F) to 0 (A). Two production codebases with their own anti-AI design manifestos independently confirmed the tells and named six more, now a family H (AI self-reference). Ships as a CLI, an MCP server, and a drop-in prompt so any team or agent can audit and prevent it.

AI designGenerative UIDesign systems
Read more
·3 min read

It is not AI assistance that homogenizes a culture. It is the loop.

IOV LABS ran a controlled study of AI-mediated cultural homogenization. Static AI help leaves a population's diversity flat; a reflective loop, the AI echoing the crowd's recent hits, collapses it by 10 to 12% over six generations. The obvious fix fails: diverse AI advisors do not prevent it.

Generative AICultural homogenizationFeedback loops
Read more
·2 min read

We tested the AI that grades other AIs. It's reliable until the answer is a tie.

IOV LABS benchmarked LLM-as-judge against ground truth across 5 frontier models. On objective items they are near-perfect and unbiased; on matched-quality ties the same judges prefer the longer answer 72-100% of the time. Reliable where verifiable, biased where subjective.

LLM-as-judgeEvaluationBenchmark
Read more
·1 min read

We built a forecasting model for the 2026 local elections and we'll grade it

IOV LABS built an AI persona + polls/fundamentals model for the 2026 Korean local elections. The method and the full pre-registered forecast are out now; after 06-03 we grade every prediction against the real result.

Election forecastingAI personasPoll aggregation
Read more
·1 min read

IOV LABS enters the permanent scientific record: our work now has a DOI

The same citation backbone that anchors peer-reviewed science now anchors IOV LABS. Two repositories, 0x-lang and the Korean text-rendering benchmark, are minted with permanent DOIs on Zenodo, the open-science archive run by CERN. Every release from here is frozen, versioned, and citable for good.

DOIZenodoOpen science
Read more
·2 min read

IOV LABS founder joins the global research record with an ORCID iD

IOV LABS founder Han Kim now holds an ORCID iD, entering the same research-identity system used by universities, journals and funders. For an independent AI lab that stakes its credibility on reproducibility, it makes the lab's open work permanently citable and accountable.

ORCIDOpen researchVerification
Read more
·2 min read

We benchmarked how well image models draw Korean. One can't at all.

IOV LABS ran a reproducible benchmark of Korean text rendering across 9 image models on 14 Hangul prompts. Three models scored zero character error; imagen-4 rendered every prompt as gibberish, 0 of 14. The benchmark is open and runs with one command.

BenchmarkImage generationKorean
Read more
·4 min read

Can a cheap verifier gate generative media, and route it?

A new IOV LABS research note carries 0x-lang's compiler-as-checker idea into image and video generation: a fast automatic verifier used as both a quality gate and a routing label. Twenty-four sources, twenty-five claims, all verified.

ResearchImage generationVideo generation
Read more
·5 min read

0x-lang: a token benchmark and a verifiable-codegen study

0x source uses about 2.4 times fewer tokens than the React it compiles to, and constrained decoding plus three compiler fixes raised a model's first-try compile rate from one in five to five in five. IOV LABS published the benchmark and study in full.

0x-langBenchmarkLLM code generation
Read more
·4 min read

IOV LABS launches in Seoul as an open-source AI research lab

IOV LABS, a new AI research lab, launched in Seoul with a focus on open-source developer tools and reproducible benchmarks. The lab said it will publish its work in English and Korean, beginning with 0x-lang, a programming language aimed at AI code generation.

AnnouncementResearch labOpen source
Read more