AI designGenerative UIDesign systemsAI slopHarness

The AI design look is not taste. It is a finite set of defaults, and we measured it.

IOV LABS taxonomized the AI-generated design look into 27 measurable tells across eight families and built a transparent detector, the Tell Score. Holding a page's content fixed and changing only the tell-bearing choices drops the score from 77 (F) to 0 (A). Two production codebases with their own anti-AI design manifestos independently confirmed the tells and named six more, now a family H (AI self-reference). Ships as a CLI, an MCP server, and a drop-in prompt so any team or agent can audit and prevent it.

You can feel it before you can name it: the indigo-to-violet gradient, Inter on white, a hero followed by three cards each with an emoji, one border-radius on everything, one soft shadow under everything, and a headline that promises you can *build the future of work*. It is the look of a page that a model made, and teams now burn real time and tokens trying to make their AI output not look like AI. The usual explanation is that escaping the look takes taste, a thing the model lacks and cannot be written down.

We think that is half right. The model does lack taste. But the thing to be escaped is not ineffable. The AI look is a *distributional artifact*: the set of choices that are individually unremarkable but collectively over-represented because they sit at the mode of the training data. Indigo is the default because Tailwind's own examples defaulted every button to it years ago, and the model learned that "modern" correlates with purple. Anything finite and recurrent can be written down, weighted, and measured.

77 → 0

Tell Score, same page, tells removed

tells across 8 families

47 pt

gap between AI-default and designed

The look, enumerated

We catalogued the recurring signals into 27 tells across eight families: color, type, layout, spacing, surface, motion, copy, and a field-derived one, AI self-reference (more below). Each tell is defined operationally (what to detect), justified by why a model defaults to it, and paired with the fix a designer would make. The loudest are the obvious ones, the indigo palette and the Inter default, but the most *telling* are the quiet ones: interactive elements with no designed hover, focus, or active states, and a missing focus ring, the residue of microstates and accessibility that were never considered.

A transparent detector reads a single HTML file, resolves both raw CSS and Tailwind utility classes, and reports a Tell Score from 0 to 100, the weighted fraction of the tells that fired. Lower is better: zero reads as authored, high reads as the median of the training set. There is no learned model and no opaque number; every point traces to a named tell and a quoted piece of evidence. Each tell also carries a memorable nickname so the taxonomy is quotable, *The Sparkle Tax* for the reflex of bolting a sparkle icon onto anything "AI", *Lorem Shipsum* for placeholder copy that shipped, *Box-in-a-Box* for a card nested in a card.

ai-default landing

77 · F

slop dashboard

54 · D

slop pricing

47 · D

refined (same page)

0 · A

designed changelog

0 · A

designed pricing

0 · A

Tell Score by page (0–100, lower is better; big bar = strong AI-default signature)

Same page, the tells removed

The headline result is deliberately confound-controlled. We take one landing page and refactor it so that *only the tell-bearing properties change*: the font, the color system, the alignment and grid, the radius hierarchy, the elevation and hairlines, the motion and microstates, and the wording. Same product, same four sections, same information. Nothing else moves. The Tell Score falls from 77 (grade F, textbook slop) to 0 (grade A). Any change is design, not content.

The same landing page before and after removing the tells — Same product, same sections, same information. Left: the AI default (77, F). Right: only the tell-bearing choices changed (0, A).

Crossfade from the AI-default page to the refined page as the score counts down — Removing the tells one family at a time: color, type, layout, surface, motion, copy.

Across a six-page corpus the detector separates AI-default pages (mean 60) from designed pages (mean 0) with no overlap. We are honest about what this shows: the designed pages score zero *by construction*, because they were authored to be tell-free. That demonstrates the fixes are sufficient to zero the score; it is not a claim that real sites in the wild score zero. The load-bearing results are the confound-controlled refactor and the clean separation.

Validated on 202 real sites

A fair objection: maybe the detector just flags everything as AI. So we rendered 202 design-led, human-crafted sites (Stripe, Linear, Toss, Apple, Vercel, Figma, Notion, and 195 more) in headless Chrome, read their computed styles, and learned where real design actually sits. Recalibrated on that data, the 202 sites score at a median of 0, while the AI-default pages still stand far out at 35 to 59. The instrument distinguishes machine-default from human-crafted on real data, not just our own fixtures.

202 real top-tier sites cluster near a Tell Score of 0; AI defaults stand out — 202 real human-crafted sites score at a median of 0; the AI-default pages stand out at 35 to 59.

The data also corrects the folklore. A brand purple is not a tell: Stripe paints 123 purple accents and still scores 0, because a custom face, optical tracking and a radius hierarchy sit next to it. Inter is not a tell either: a quarter of top sites use it, Linear among them, with a real type system. So we score with a craft-credit model, where compensating craft offsets cosmetic defaults. No single signal is the tell. The AI look is the bundle of defaults with nothing decided next to them. The same detector now audits a live URL, not only a code string.

From "don't" to "do": a spec catalog

A detector only ever says what *not* to do. The question from anyone trying to actually build is the positive one: then what numbers should I use? So we read the answer off the same evidence. We rendered 199 of those sites a second time, and this round recorded the concrete CSS they ship per component, in both light and dark, twice per site under each prefers-color-scheme. The result is a measured reference, not a house style.

Measured button radii, type scale, dark backgrounds and accent colors across 199 sites — Measured across 199 design-led sites: button radius splits between soft-round and pill, the type scale lands near 64/48/32/16px, dark backgrounds are tinted near-blacks, and accent hues are fully dispersed.

A few of the numbers. Primary-button corner radius does not converge on one value: it splits between a soft-rounded 8 to 12px cluster and a full pill, with sharp corners a deliberate minority and only 13% of buttons carrying any shadow. The real type scale lands near 64 / 48 / 32px for the heading levels over a 16px body, headlines on a tight line-height with frequent negative tracking. Content containers center on a 1200px median; spacing follows a 4/8px grid, but only about 70% of the time, not the religious adherence a generator assumes. And dark mode has a grammar: the page background is almost never pure black, it is a *tinted* near-black, raised surfaces sit a step lighter rather than a hairline border away, and text is an off-white. The full per-component tables, and a per-site appendix, ship in the repo as reference/COMPONENT-SPECS.md; the harness now carries these targets so its guidance is two-sided.

To prove the targets are buildable and not just describable, we assembled one landing page straight from those medians: a 1200px container, a 64/48/32px serif type scale, 8px buttons with every microstate, an owned amber accent instead of the default indigo, and a dark mode that follows the measured grammar. It is a single self-contained file, and it scores a Tell Score of 0 in both palettes.

A landing page built straight from the catalog medians, shown light and dark, scoring 0 — One self-contained page assembled from the catalog medians (1200px grid, 64/48/32px serif scale, 8px buttons, an owned amber accent). Tell Score 0 in both palettes; the dark side follows the measured grammar, a tinted near-black, not pure black.

The sample page crossfading between light and dark — Same file, two palettes. Dark is #0e1014 with surfaces a step lighter, the grammar the catalog measured.

Korean type is different, in two ways and not a third

The catalog above is mostly Western, and hangul is set differently from Latin. So we ran the same extraction over 48 Korean design-led sites (Toss, Kakao, 당근, 무신사, 29CM, 오늘의집, 배민, 업비트, and more) and compared. The honest result separates a real difference from a folk one.

Korean web type vs the global set: Pretendard dominance, smaller body size, equal line-height — 48 Korean sites vs the global set. The real differences are the font (Pretendard) and a smaller body size (14px vs 16px); line-height is nearly identical.

Two things really change. The font: Pretendard is the Korean web's Inter, the free, well-hinted default, on 44% of these sites as the body face and a hangul sans on 69%. So the type tell translates directly, bare Pretendard with no scale reads machine-default exactly as bare Inter does, and the owned move is a commissioned face (Toss Product Sans, Bithumb Trading Sans, Gmarket Sans, Wanted Sans). And the body size: Korean text sets smaller, a 14px median against 16px globally, because hangul carries more ink per glyph and Korean product culture is denser. The folk difference is line-height: hangul is often said to need far more leading, but the measured median is about 1.5 in both, because Pretendard already ships generous leading. One honest caveat: the Korean h1 median is bimodal, design-led product sites use 56 to 90px heroes like the West while portals and commerce are banner-driven with small or absent display headlines, so the low aggregate is a density culture, not a different idea of a headline.

What two teams had already banned

The sharpest confirmation came from the opposite direction. We had access to two production codebases whose maintainers, with no knowledge of this taxonomy, had already written their own internal design manifestos titled, in effect, *avoid the AI look*, and logged the cleanup with per-instance counts: one patched roughly 600 sub-12px labels to a 12px floor, the other listed every Sparkles icon to remove. Both are private commercial products, so we treat them anonymously as a dark-mode media tool and a productivity assistant.

Two findings matter. First, they independently named the tells the detector already had: the indigo default, the blue-to-purple gradient (banned in so many words as "AI 풍 그라데이션"), emoji-as-icon, the generic font, vague system-voice copy. Practitioners writing only to stop their own products from looking generated arrived at the same list, the strongest evidence the families are real and not an artifact of our framing.

Second, the field saw further. Both led with a register the taxonomy had not covered: the interface announcing itself as AI. We added it as a new family, H, AI self-reference, with three tells, plus three more in existing families:

- AI-cliché iconography, the Sparkles / Wand / Bot / Brain / Cpu icon set bolted onto any "AI" feature. - Labeling the feature "AI" or naming the model, "AI-powered", "AI 분석", an exposed GPT-4 or Claude in the UI. Name the function by what it does; show the model only in settings. - Generate, preview, then insert, the assistant-panel ceremony. Apply the result into the content and let the user undo. - Multi-color pill-badge inflation, a row of status pills each in a different bright hue. - Sub-legible micro-type, scattered 9 to 11px labels; set a 12px floor. - Nested card-in-card chrome, the "double box"; use one outer card with a flat divided list.

Six field-named tells the taxonomy missed, and which of two manifestos named each — Two production codebases with owner-authored anti-AI design manifestos named six tells the taxonomy missed: a new family H (AI self-reference) plus the multi-color pill, micro-type, and nested box.

These six raise the taxonomy to 27 tells. They are detected by the code detector you run on the markup an agent just wrote, where icon imports and label text are visible; the live computed-style audit does not see them yet, which we state as a boundary, not a result.

A harness, not just a verdict

The taxonomy ships as a tool, three ways, all from one source. A command-line linter whose exit code gates CI. An MCP server, so a coding agent can audit the UI it just wrote *before* showing it to you, get the specific fixes, and iterate. And a drop-in prompt module that turns each detected tell into a preventive instruction you can paste into a system prompt, CLAUDE.md, or a v0/Lovable instructions field. The workflow is generate, score, fix, re-score.

Install it in one line

It is on PyPI and ships as a Claude Code plugin. The fastest path is the plugin: run /plugin marketplace add hankimis/ai-design-tells, then /plugin install ai-design-tells@iov-labs, and the MCP server is registered for you. Any other MCP client (Cursor, Claude Desktop) can point at the uvx command instead, and a plain pip install ai-design-tells gives the ai-design-tells command in your terminal. However it is wired, the agent then audits the UI it just wrote before showing it to you, returning each fired tell with its nickname, the evidence, and the fix, and looping until the Tell Score drops. The plugin and uvx paths run the server through uv, so install it once with curl -LsSf https://astral.sh/uv/install.sh | sh (or brew install uv); the plain pip path does not need it.

"It looks like AI" does not mean a machine made it. It means no one in particular made it. The gradient is not ugly; it is unowned. Removing the tells is the act of putting an owner back into the page.

There is a last irony, and it is the reason we keep the score humble. Our companion study, *Convergence Pressure*, found that what homogenizes a population is not AI assistance but the *loop* of everyone consulting the same oracle. A single, widely adopted design score is such an oracle. If every team optimizes the same Tell Score with the same fixes, the escape from the indigo mean becomes a new mean. The right use of the harness is the one we argue for throughout: a detector of the *absence* of a decision, a prompt to make a choice, not a prescription of which choice to make. It should send you toward your own distribution, not toward a new shared one. Code, data, figures, and harness are public.

GitHub (run it)Paper