0x-langBenchmarkLLM code generationDSLConstrained decoding

0x-lang: a token benchmark and a verifiable-codegen study

0x source uses about 2.4 times fewer tokens than the React it compiles to, and constrained decoding plus three compiler fixes raised a model's first-try compile rate from one in five to five in five. IOV LABS published the benchmark and study in full.

IOV LABS, a Seoul-based AI research lab, said on May 29 that it has published a benchmark and an accompanying study for 0x-lang, its open-source programming language. The work asks two questions that sit at the center of AI-assisted programming: how compact a language can be when measured the way a model actually consumes it, and whether a model can be made to write that language reliably. The lab reported that 0x source uses about 2.4 times fewer tokens than the React it compiles to, and that constraining the model's output, combined with three fixes to the compiler, raised the first-try compile rate from one in five to five in five.

2.4×

fewer tokens than React

1 → 5

first-try compiles, of 5

7 / 8

on unseen tasks

One 0x source compiling to React, Vue and Svelte at once.

Background

0x-lang compiles a single source file into six targets at once: React, Vue 3, Svelte 5, React Native, Express and Terraform. It ships with a language server for editor support and a Model Context Protocol (MCP) server so that AI agents can call the compiler directly. The lab frames the project around a single premise: as more software is written by language models rather than by people, the representations those models read and write deserve to be designed deliberately, not inherited by accident from training data. The study is the lab's attempt to test that premise with numbers rather than assertions.

The token benchmark

The benchmark deliberately measures token counts rather than lines of code. A language model does not read or pay for lines; it reads and is billed by the token, so tokens are the unit that determines both cost and the share of a context window a program consumes. Using a byte-pair-encoding tokenizer of the kind GPT models use, the lab counted tokens across all ten examples in the repository and compared each 0x source against the exact framework code the compiler generates from it, with no hand-tuning and no cherry-picking. Across those examples, 0x source came in at roughly 2.4 times fewer tokens than the generated React, 1.9 times fewer than the Vue, and 1.8 times fewer than the Svelte. The lab characterized the figures as a conservative lower bound and said the entire measurement is reproducible with a single command, npm run benchmark.

vs React

2.4×

vs Vue

1.9×

vs Svelte

1.8×

Tokens in 0x vs the framework code it compiles to (higher = more savings)

Whether a model can write it

A compact representation is only useful if a model can produce it correctly, so the study tested first-try generation with gpt-4o. Given a task description and asked to return compiling 0x, the model succeeded on only one of five tasks under a plain prompt, and every failure was a syntax error rather than a logical one. The report attributes this to a simple fact: the model has never seen 0x in its training data. For any language outside the training distribution, the study argues, correctness depends far less on how compact the language is than on how tightly the generation process can be constrained. Compactness, in other words, is worthless without a way to keep the model on the rails.

From prompt to structure

The instinct in such cases is to write a longer prompt with more examples, but the lab found that this helps only marginally and fails often. Its alternative is structural. Rather than asking the model to emit 0x text directly, the lab constrains it to emit a JSON abstract syntax tree whose shape is guaranteed by a schema, using structured-output decoding, and then renders canonical 0x from that tree with the compiler itself. The model never has to remember the surface syntax of an unfamiliar language; it only has to express intent inside a structure it cannot malform. The grammar stops being a memory test and becomes a guardrail.

Three compiler fixes

Constraining the output carried the model most of the way, but it also exposed real gaps in the compiler's expression grammar that no prompt would have revealed. Closing them required three concrete changes. Native spread syntax was desugared in the parser so that idioms a model naturally produces would compile. Strict-equality operators were normalized in the tokenizer so that the two ways of writing equality resolve to one. And two lexing bugs that mishandled specific token boundaries were repaired. None of the three is glamorous, and together they are the difference between "almost compiles" and "compiles." With the constrained decoding and the fixes in place, the first-try compile rate rose to five of five, and held at seven of eight on a separate, previously unseen set of tasks. All 303 of the project's existing tests continue to pass.

A fast compiler is a checker the model can be measured against: code either compiles or it does not.

Limitations and next steps

The lab was explicit about what the study does not yet establish. The evaluation set is small, at eight to ten tasks, and a single headline number should not be over-read until that set is larger. The expression grammar still has corners the schema does not fully cover, which means some valid programs remain harder to generate than they should be. The lab also wants to compile the grammar to GBNF so that local, open-weight models can be driven under the same guarantees as a hosted model, rather than relying on a single provider's structured-output feature. Each of these is named in the report as open work rather than a solved problem.

What the lab says

Han Kim, the founder of IOV LABS, said the result reframes what a compact language is for. "The interesting finding was not that 0x is small," he said. "It is that a small, well-defined language is one a compiler can check, and that turns the compiler into a verifier the model can be measured against." He added that the lab deliberately published the failures alongside the successes, including the one-in-five starting point and the retired line-based claim, because "a benchmark you cannot reproduce is marketing, and the only number worth reporting is the one someone else can re-run."

The benchmark and the full study, including the evaluation tasks and the measurement script, are open in the project's repository. The lab said it would rather readers re-run the numbers than take its word for them.

Read the study (EVAL.md)Benchmark (REPORT.md)GitHub