Korean Text Rendering in Text-to-Image Models: A Reproducible Character-Error-Rate Benchmark
Abstract
Benchmarks for text inside generated images are overwhelmingly English, which conceals the writing systems where models actually fail. We measure one directly: nine text-capable text-to-image models each draw fourteen Korean (Hangul) phrases on an identical plain poster, the rendered text is transcribed by a vision-language model (GPT-4o), and scored by character error rate (CER). Three models — recraft-v4-pro, seedream-5, and nano-banana-pro — render every prompt perfectly (CER 0.000, 14/14), and a clear quality gradient follows. At the bottom, imagen-4 cannot write Hangul at all: it produces plausible-looking Korean-shaped gibberish on every prompt (0/14, mean CER 1.33), turning 커피 한 잔 into 소동석 고려아는 아라해안. The central finding is that strong English text rendering does not transfer to Korean, and is invisible to an English-only benchmark. The harness is open, runs with one command, resumes from saved results, and is trivially extensible to new prompts and models.
Keywords
- text-to-image generation
- visual text rendering
- Hangul
- Korean
- OCR
- character error rate
- evaluation
- benchmark
- reproducibility