BenchmarkImage generationKoreanText renderingOCR

We benchmarked how well image models draw Korean. One can't at all.

IOV LABS ran a reproducible benchmark of Korean text rendering across 9 image models on 14 Hangul prompts. Three models scored zero character error; imagen-4 rendered every prompt as gibberish, 0 of 14. The benchmark is open and runs with one command.

Most benchmarks for text inside generated images are written in English, which hides the writing systems where models actually break. IOV LABS measured one of them directly. Nine text-capable image models were each asked to draw 14 Korean phrases on a plain poster, and the rendered text was read back by GPT-4o and scored by character error rate. The result is a clear ranking, and one blunt failure.

0 / 14

imagen-4 exact renders

models at zero error

9 x 14

models x Hangul prompts

Nine image models rendering the same Korean prompts — The same prompts drawn by nine models, cycling through all fourteen. Label shows model and CER.

What was measured

Each model received the same instruction for every prompt: a white poster with the target Hangul as the only text, black sans-serif lettering. The phrases ranged from a simple greeting to hard cases such as the clusters in 값을 and 맑음, the tense consonants in 떡볶이, a full sentence, place names, and digits mixed with Korean. GPT-4o transcribed whatever each model drew, and character error rate scored it against the target, with whitespace ignored. Zero means a perfect render.

The result

Three models rendered every prompt perfectly: recraft-v4-pro, seedream-5 and nano-banana-pro, all at zero error across 14 prompts. gpt-image-2 and recraft-v4 followed closely. The cheaper and older models slipped on the harder strings, with ideogram-v3 managing only 5 of 14.

recraft-v4-pro

14/14

nano-banana-pro

14/14

gpt-image-2

12/13

flux-2-flash

9/14

ideogram-v3

5/14

imagen-4

0/14

Exact Hangul renders out of 14 prompts (higher is better)

The prompt 닭갈비 맛집 rendered by nine models — The hardest prompt, 닭갈비 맛집, across all nine models.

The blunt failure

The standout is imagen-4, which scored 0 of 14 with a mean error above one. It did not merely make typos, it produced plausible-looking Korean-shaped gibberish: 커피 한 잔 came back as 소동석 고려아는 아라해안, and 맑음 as 옹반재다. A model can be excellent at English text and still be unable to write Hangul at all, which is exactly the gap an English-only benchmark would never surface.

imagen-4 renders Korean text as gibberish — imagen-4: the label is the intended text, the image is what it actually drew.

Strong English text rendering does not transfer to Korean. You only see it if you measure it.

Open and reproducible

The benchmark is public and runs with one command given a fal.ai and an OpenAI key. The harness resumes from saved results, so re-running only retries failed cells, and the prompt and model lists are easy to extend. It is the first concrete experiment from the lab's research note on verifiable quality for generative media, where Korean text rendering was flagged as an open field.

The benchmark leaderboard printed in the terminal — The harness prints the full leaderboard in the terminal, reproducible with one command.

GitHub (run it)Research note