News

Latest from the lab

2026.07.20·4 min read

Ask and it knows: the model doubted its own fabrication, and said nothing

We asked about 30 things that do not exist. The model invented 28 of them with no hedging at all. Then we asked, about those same answers, how confident it was: 6.8 out of 10. The doubt was there the whole time. Our pre-registered hypothesis, that the hidden internal signal beats the spoken one, was refuted: asked directly, the model reports its doubt better than its own entropy does. So the missing piece is not judgment and not expression. It is that nothing carries the doubt into the answer unless something outside stops and asks.

HallucinationConfabulationUncertainty

2026.07.17·3 min read

The Productivity Mirage: why adopting AI tools can make you feel faster while measured productivity falls

The most rigorous measurement found skilled developers 20% faster by feel but 19% slower when measured. We show that gap is not a fluke: three of our own studies measure the same perception-reality split in different domains, and when you compound the leaks we measured, a felt +43% turns into a measured -18%, matching that -19% without tuning. Control layers recover it to +27%. A synthesis and a model, with the interest disclosed.

AI productivityPerception gapVerification

2026.07.17·3 min read

The Vibe Tax: vibe coding is up, but the vulnerabilities just moved somewhere you and your scanner can't see

We prompted two models to write ten security-sensitive Python tasks fast versus secure and measured the code three ways. Asking only for speed raises the vulnerability rate from 20% to 50%, in both models regardless of vendor. But the danger shifted: models are now safe-by-default on the famous bugs and fail instead on trust and verification, JWT signature skips, XSS, SSRF. Worst, of the 35% truly vulnerable, the generic scanner caught zero, exactly where a developer trusts a green lint. The model can half-see it, but only when you stop and ask, never while coding. A pilot, with the interest disclosed.

Vibe codingAI code securityVulnerabilities

2026.07.17·3 min read

The RAI Pipeline: enterprise AI value comes from the control layers, not the model

Everyone competes on model quality, but enterprise adoption is decided by the layers around the model. We formalize a five-stage pipeline, authorization, DLP, RAG, routing and caching, audit, as one composed function, then price its cost layer at 2026 rates. A representative RAG request drops 65.9%, from $0.0225 to $0.0077, and routing alone (46.7%) reproduces the field-observed 47% unit-cost drop. The saving is fully set by two levers we publish in full. Value reads as a product, safety times accuracy times economy times controllability, so any zeroed layer zeroes the whole. A model, not a measurement, with the interest disclosed.

Enterprise AIAI gatewayPrompt caching

2026.07.15·2 min read

The Deflection Dividend: what AI customer-service automation is actually worth to a Korean small business

Enterprise call-center automation is everywhere in the news; the segment with the most leverage, Korean SMBs, is the least quantified. Using only two public 2026 benchmarks, the fully-loaded cost of a Korean agent and the deflection rate of AI support, the saving becomes arithmetic. A three-agent team saves about 62M KRW a year at a conservative 55% deflection, and the whole result hinges linearly on that one number, which we deliberately set low. This is a model, not a measurement, and the author discloses running a product in this category.

AI customer serviceDeflection rateSMB

2026.06.15·2 min read

AI agents say they finished. On the cheap models, they are wrong about an eighth of it, and cannot tell.

IOV LABS measured whether an AI agent's report that a task is 'done' is true. Across 896 verifiable task instances on four models, agents claimed a perfect score on every run. The false-completion rate is capability-tiered: small cheap models overclaim ~13%, frontier models ~0-5%, and the errors hide in character-level tasks. Asking the model to self-check did not help. The fix is not a better prompt but a different place to put trust: verify completion in the system around the agent, the agent control tower.

AI agentsTask managementMCP

2026.06.06·2 min read

The judge in the mirror: LLM evaluators favor their own family, but cannot see why

IOV LABS audited self-preference in LLM-as-judge on four current frontier models, blind, with a consensus baseline that separates bias from quality. Every judge inflates its own family over the neutral panel by about 14 points. But the standard story is wrong: only one model recognizes its own outputs, yet all four self-prefer. The bias is implicit. Beneath it, the first-shown answer wins 63% of the time and length predicts the verdict at 0.98.

LLM-as-judgeSelf-preferenceLeaderboards

2026.06.05·2 min read

Fluency is not foresight: LLMs forecast the future worse than a coin

IOV LABS audited LLM probabilistic forecasting the only contamination-proof way, scoring models only on events that resolve after their training cutoff. Post-cutoff forecasts land at Brier 0.296, worse than always saying 50%. A model that postdates an event remembers it near-perfectly; the same model forecasting collapses to chance. And a simple statistical model beats every LLM on a real election.

ForecastingCalibrationBrier score

2026.06.03·4 min read

A model that knows it is right does not fold, even when you swear it is wrong

IOV LABS re-ran the 2023 sycophancy test on three current Claude models with a control that separates reconsideration from deference. Across 500 trials of social pressure on facts the models knew, there was exactly one capitulation. Sonnet 4.6 and Opus 4.8 never abandoned a correct answer; no model ever accepted an absurd one. Capitulation now decreases with capability, inverting the older result.

SycophancyLLM behaviorAI safety

2026.06.02·3 min read

LLMs know when they're being tested, and GPT gets more honest because of it

IOV LABS ran a controlled, black-box study of evaluation-elastic behavior. Every model recognizes an evaluation framing 100% of the time. The behavioral shift is localized but real and user-adverse: GPT is more sycophantic toward a real user than a grader, so an honesty benchmark over-certifies what a user actually gets.

Evaluation awarenessAI safetySycophancy

2026.06.02·2 min read

The tools are mature. ROI is decided by the control system, not the tool.

IOV LABS built a vendor-neutral, source-backed playbook on how a company actually adopts AI to maximize efficiency: which models, agents, and setups. Five deep-research passes with adversarial verification across dev, design, ops, governance, and security. Overstated stats were rejected.

Enterprise AIAI adoptionCoding agents

2026.06.02·10 min read

The AI design look is not taste. It is a finite set of defaults, and we measured it.

IOV LABS taxonomized the AI-generated design look into 27 measurable tells across eight families and built a transparent detector, the Tell Score. Holding a page's content fixed and changing only the tell-bearing choices drops the score from 77 (F) to 0 (A). Two production codebases with their own anti-AI design manifestos independently confirmed the tells and named six more, now a family H (AI self-reference). Ships as a CLI, an MCP server, and a drop-in prompt so any team or agent can audit and prevent it.

AI designGenerative UIDesign systems

2026.06.01·3 min read

It is not AI assistance that homogenizes a culture. It is the loop.

IOV LABS ran a controlled study of AI-mediated cultural homogenization. Static AI help leaves a population's diversity flat; a reflective loop, the AI echoing the crowd's recent hits, collapses it by 10 to 12% over six generations. The obvious fix fails: diverse AI advisors do not prevent it.

Generative AICultural homogenizationFeedback loops

2026.05.30·2 min read

We tested the AI that grades other AIs. It's reliable until the answer is a tie.

IOV LABS benchmarked LLM-as-judge against ground truth across 5 frontier models. On objective items they are near-perfect and unbiased; on matched-quality ties the same judges prefer the longer answer 72-100% of the time. Reliable where verifiable, biased where subjective.

LLM-as-judgeEvaluationBenchmark

2026.05.30·1 min read

We built a forecasting model for the 2026 local elections and we'll grade it

IOV LABS built an AI persona + polls/fundamentals model for the 2026 Korean local elections. The method and the full pre-registered forecast are out now; after 06-03 we grade every prediction against the real result.

Election forecastingAI personasPoll aggregation

2026.05.29·1 min read

IOV LABS enters the permanent scientific record: our work now has a DOI

The same citation backbone that anchors peer-reviewed science now anchors IOV LABS. Two repositories, 0x-lang and the Korean text-rendering benchmark, are minted with permanent DOIs on Zenodo, the open-science archive run by CERN. Every release from here is frozen, versioned, and citable for good.

DOIZenodoOpen science

2026.05.29·2 min read

IOV LABS founder joins the global research record with an ORCID iD

IOV LABS founder Han Kim now holds an ORCID iD, entering the same research-identity system used by universities, journals and funders. For an independent AI lab that stakes its credibility on reproducibility, it makes the lab's open work permanently citable and accountable.

ORCIDOpen researchVerification

2026.05.29·2 min read

We benchmarked how well image models draw Korean. One can't at all.

IOV LABS ran a reproducible benchmark of Korean text rendering across 9 image models on 14 Hangul prompts. Three models scored zero character error; imagen-4 rendered every prompt as gibberish, 0 of 14. The benchmark is open and runs with one command.

BenchmarkImage generationKorean

2026.05.29·4 min read

Can a cheap verifier gate generative media, and route it?

A new IOV LABS research note carries 0x-lang's compiler-as-checker idea into image and video generation: a fast automatic verifier used as both a quality gate and a routing label. Twenty-four sources, twenty-five claims, all verified.

ResearchImage generationVideo generation

2026.05.29·5 min read

0x-lang: a token benchmark and a verifiable-codegen study

0x source uses about 2.4 times fewer tokens than the React it compiles to, and constrained decoding plus three compiler fixes raised a model's first-try compile rate from one in five to five in five. IOV LABS published the benchmark and study in full.

0x-langBenchmarkLLM code generation

2026.05.01·4 min read

IOV LABS launches in Seoul as an open-source AI research lab

IOV LABS, a new AI research lab, launched in Seoul with a focus on open-source developer tools and reproducible benchmarks. The lab said it will publish its work in English and Korean, beginning with 0x-lang, a programming language aimed at AI code generation.

AnnouncementResearch labOpen source