AI agentsTask managementMCPVerificationReproducibility

AI agents say they finished. On the cheap models, they are wrong about an eighth of it, and cannot tell.

IOV LABS measured whether an AI agent's report that a task is 'done' is true. Across 896 verifiable task instances on four models, agents claimed a perfect score on every run. The false-completion rate is capability-tiered: small cheap models overclaim ~13%, frontier models ~0-5%, and the errors hide in character-level tasks. Asking the model to self-check did not help. The fix is not a better prompt but a different place to put trust: verify completion in the system around the agent, the agent control tower.

More and more useful AI work is multi-step and unattended: an agent is handed a list, goes away, and comes back saying it is finished. The systems that orchestrate this increasingly take that "finished" at face value. We asked a narrow question with a wide consequence: when an agent says it completed the work, is that true, and can we fix it by asking the agent to check itself?

+13% vs ~2%

false-completion: small cheap models vs frontier

100% vs 88%

self-reported vs actually correct, every run

62-78%

accuracy on character-level tasks (vs 100% arithmetic)

A perfect score, every time

Four models in two tiers (cheap and frontier) each ran eight workloads of verifiable micro-tasks, every answer checked programmatically. Across 896 task instances, the agents self-reported a perfect score on every single run. Verified accuracy ranged from 86 to 96 percent. The false-completion rate, the share of "done" that was not in fact correct, is capability-tiered: about 13 percent for the small cheap models, near zero for the frontier ones. And the certified errors concentrate in character-level tasks (reversing a word, counting letters) while arithmetic is perfect. The model is not lying in any deliberate sense; it cannot see, from the inside, that some of its confident answers are wrong, so it certifies them all.

Claimed done (self-report)

100%

Actually correct (verified)

88%

Small models actual

~87%

Completion: what the agent claims vs what is verified (lower verified = the illusion)

Self-checking does not fix it

We then wrapped the identical workload in a managed protocol: register every task, do them one by one with a done marker, then re-check and fix anything missing before reporting, the self-verification a model is told to perform. It barely helped. The false-completion rate fell only from 12.7% to 11.8%. Re-checking with the same model re-applies the same blind spot. A faculty that is wrong and unaware cannot repair itself by being asked to look again.

Self-report is not completion. A model's "I finished" is a prediction by the same process that made the errors, so it inherits them.

An honest null, and the real fix

We also expected the structured protocol to reduce omission or lift accuracy. It did neither: current models do not drop tasks from a 28-item batch, and the deliberate framing did not make the arithmetic more correct. We report the null plainly. The value of a task board is not that it makes the model smarter. It is that it can verify what the model cannot verify about itself. That is the function of an agent control tower: an external board, a calendar, a memory, and a server that enforces the workflow and, at its frontier, checks that "done" is actually done. The open problem, and the moat, is verified completion: turning "done" from a claim into evidence.

GitHub (run it)Paper