VeraBench

A benchmark for evaluating LLM code generation in Vera, a programming language designed for large language models (LLMs) to write.

Results

Results from VeraBench v0.0.7 against Vera v0.0.108 across 50 problems, 6 models, and 4 modes per model.

run_correct by model (Vera full-spec vs Python vs TypeScript)

Flagship tier:

Model	Vera	Python	TypeScript
Kimi K2.5	100%	86%	91%
GPT-4.1	91%	96%	96%
Claude Opus 4	88%	96%	96%

Sonnet tier:

Model	Vera	Python	TypeScript
Kimi K2 Turbo	83%	88%	79%
Claude Sonnet 4	79%	96%	88%
GPT-4o	78%	93%	83%

Key findings

Kimi K2.5 writes perfect Vera code — 100% run_correct on both full-spec and spec-from-NL modes, beating Python (86%) and TypeScript (91%). This is the first model where Vera is the best language across the board.

Three models beat TypeScript on Vera. Kimi K2.5 (+9pp), Kimi K2 Turbo (+4pp), and in our initial v0.0.4 benchmark Claude Sonnet 4 also beat TypeScript (83% vs 79%). The pattern is consistent across providers: Vera's mandatory contracts and typed slot references provide enough structure to compensate for zero training data.

Python remains the strongest target for most models. Claude, OpenAI, and Moonshot all hit 86–96% run_correct on Python. The gap between Python and Vera varies from 0pp (Kimi K2.5 spec-from-NL: both 100%) to 17pp (Claude Sonnet 4: 96% vs 79%).

Spec-from-NL is the harder test. When models must infer their own contracts from natural language, performance drops for most models — GPT-4.1 falls from 91% to 50%. But Kimi K2.5 holds at 100%, suggesting it has internalised Vera's type system well enough to author specifications from scratch.

Note: These are single-run results. LLM outputs are non-deterministic — individual problems can flip between pass and fail across runs. The v0.0.4 Claude Sonnet 4 result (83% Vera, 79% TypeScript) shifted to 79%/88% in the v0.0.7 re-run, illustrating this variance. Stable rates will require pass@k evaluation with multiple trials. This is early days — 50 problems, one run per model.

Why this matters: zero training data

No LLM has ever been trained on Vera. There are no Vera examples on GitHub, no Stack Overflow answers, no tutorials — the language was created after these models' training cutoffs. Every token of Vera code in these results was written by a model that learned the language entirely from a single document (SKILL.md) provided in the prompt at evaluation time.

Python and TypeScript, by contrast, are among the most heavily represented languages in LLM training data — billions of lines of code, documentation, and Q&A. The fact that multiple models write better Vera than TypeScript despite this asymmetry suggests that language design matters more than training data volume. Vera's mandatory contracts, typed slot references, and explicit effect annotations give models enough structural guardrails that in-context instruction alone is sufficient — no pre-training required.

Overview

VeraBench measures whether LLMs write better code in a language designed for them. Vera uses typed slot references instead of variable names, mandatory contracts, and explicit algebraic effects — all features that should make LLM-generated code more verifiable.

The benchmark covers five difficulty tiers:

Tier	Focus	What it tests
1	Pure arithmetic	Basic syntax, `@T.n` slot references, simple contracts
2	String & array ops	Built-in function discovery (`domain_verb` naming)
3	ADTs & match	Data type definition, De Bruijn indices in match arms
4	Recursion & termination	`decreases` clauses, Z3 verification
5	Multi-function & effects	IO, State, Exn, effect propagation across functions

For each problem, we measure:

check@1 — Does the code pass vera check on first attempt?
verify@1 — Does it pass vera verify (Z3 contract verification)?
fix@1 — Given the error message, can the model fix it in one turn?
run_correct — Does execution produce the correct output?

The same problems are also run in Python, TypeScript, and Aver as baselines. Aver is a Haskell-inspired language with zero LLM training data, providing a second data point alongside Vera for the zero-training-data thesis.

Cross-language comparison: For cross-language headline rates, use the T1–T4 aggregate. Tier 5 tests Vera's algebraic effect handlers, which other languages solve with fundamentally different native idioms. See #50.

Prerequisites

Python 3.11+
Git
Node.js 22+ (optional, for TypeScript baselines and generation)
Aver (optional, for Aver baselines and generation)

Installation

git clone https://github.com/aallan/vera-bench.git
cd vera-bench
python -m venv .venv
source .venv/bin/activate
pip install -e ".[llm]"

The [llm] extra installs the Anthropic and OpenAI SDKs. Use pip install -e . if you only need validation (no model evaluation).

Install the Vera compiler

The vera command must be available on $PATH. Install it anywhere into the same environment, either from a local clone,

pip install -e /path/to/vera

or directly from GitHub.

pip install git+https://github.com/aallan/vera.git

Afterwards you should be able to print the Vera version from the terminal,

vera version

this should return v0.0.108 or later.

Quick start

Once Vera is installed you can run the benchmark from the terminal,

# Validate all 60 problems and canonical solutions
vera-bench validate

# Run benchmark against a model
export ANTHROPIC_API_KEY=sk-ant-...
vera-bench run --model claude-sonnet-4-20250514

# Run a single tier
vera-bench run --model claude-sonnet-4-20250514 --tier 1

# Run a single problem
vera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001

# Spec-from-NL mode (agent writes its own contracts)
vera-bench run --model claude-sonnet-4-20250514 --mode spec-from-nl

# Ask the same model to write Python, TypeScript, or Aver for comparison
vera-bench run --model claude-sonnet-4-20250514 --language python
vera-bench run --model claude-sonnet-4-20250514 --language typescript
vera-bench run --model claude-sonnet-4-20250514 --language aver

# Run canonical baselines as a reference
vera-bench baselines
vera-bench baselines --language typescript
vera-bench baselines --language aver

# Generate a combined report
vera-bench report results/

# Or run the full benchmark suite (all 8 targets) with one command
python scripts/run_full_benchmark.py

Supported providers: Anthropic (Claude), OpenAI (GPT), and Kimi (Moonshot). Set the appropriate API key environment variable (ANTHROPIC_API_KEY, OPENAI_API_KEY, or MOONSHOT_API_KEY).

The Vera language reference (SKILL.md) is fetched automatically from veralang.dev when running Vera benchmarks. To use a local copy instead (e.g., for testing unreleased language features):

vera-bench run --model claude-sonnet-4-20250514 --skill-md /path/to/SKILL.md

Report generation

Running vera-bench report results/ generates results/summary.md with a summary table, per-tier breakdowns, and per-problem detail. Each vera-bench run writes incremental JSONL results (one line per problem attempt), so partial runs are resumable and always reportable. Results files are in .gitignore — they are generated artifacts, not checked in.

Prior art

VeraBench is inspired by:

HumanEval — 164 Python function completion problems
MBPP — 974 Python problems from natural language
DafnyBench — 782 Dafny verification annotation problems

DafnyBench demonstrated that tracking verification success rates over time attracts genuine research attention — success rates went from 68% to 96% across model generations in under two years. VeraBench aims to create the same longitudinal story for a language designed from scratch for LLM code generation.

Citation

@software{verabench2026,
  author = {Allan, Alasdair},
  title = {VeraBench: a benchmark suite for LLM code generation in Vera},
  year = {2026},
  url = {https://github.com/aallan/vera-bench}
}

License

VeraBench is licensed under the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.github		.github
analysis		analysis
assets		assets
problems		problems
results		results
scripts		scripts
solutions		solutions
tests		tests
vera_bench		vera_bench
.coderabbit.yaml		.coderabbit.yaml
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VeraBench

Results

run_correct by model (Vera full-spec vs Python vs TypeScript)

Key findings

Why this matters: zero training data

Overview

Prerequisites

Installation

Install the Vera compiler

Quick start

Report generation

Prior art

Citation

License

About

Uh oh!

Releases 9

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VeraBench

Results

run_correct by model (Vera full-spec vs Python vs TypeScript)

Key findings

Why this matters: zero training data

Overview

Prerequisites

Installation

Install the Vera compiler

Quick start

Report generation

Prior art

Citation

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 9

Contributors

Uh oh!

Languages