More and more useful AI work is multi-step and unattended: an agent is handed a list, goes away, and comes back saying it is finished. The systems that orchestrate this increasingly take that "finished" at face value. We asked a narrow question with a wide consequence: when an agent says it completed the work, is that true, and can we fix it by asking the agent to check itself?
A perfect score, every time
Four models in two tiers (cheap and frontier) each ran eight workloads of verifiable micro-tasks, every answer checked programmatically. Across 896 task instances, the agents self-reported a perfect score on every single run. Verified accuracy ranged from 86 to 96 percent. The false-completion rate, the share of "done" that was not in fact correct, is capability-tiered: about 13 percent for the small cheap models, near zero for the frontier ones. And the certified errors concentrate in character-level tasks (reversing a word, counting letters) while arithmetic is perfect. The model is not lying in any deliberate sense; it cannot see, from the inside, that some of its confident answers are wrong, so it certifies them all.
Self-checking does not fix it
We then wrapped the identical workload in a managed protocol: register every task, do them one by one with a done marker, then re-check and fix anything missing before reporting, the self-verification a model is told to perform. It barely helped. The false-completion rate fell only from 12.7% to 11.8%. Re-checking with the same model re-applies the same blind spot. A faculty that is wrong and unaware cannot repair itself by being asked to look again.
Self-report is not completion. A model's "I finished" is a prediction by the same process that made the errors, so it inherits them.
An honest null, and the real fix
We also expected the structured protocol to reduce omission or lift accuracy. It did neither: current models do not drop tasks from a 28-item batch, and the deliberate framing did not make the arithmetic more correct. We report the null plainly. The value of a task board is not that it makes the model smarter. It is that it can verify what the model cannot verify about itself. That is the function of an agent control tower: an external board, a calendar, a memory, and a server that enforces the workflow and, at its frontier, checks that "done" is actually done. The open problem, and the moat, is verified completion: turning "done" from a claim into evidence.