This file is the practical playbook for agents working in this repo.
- Purpose: benchmark how coding models solve React Native tasks.
- Primary engine:
runner/orchestrates discovery -> solve -> LLM judge -> summary. - Dataset: evals under
evals/<category>/<eval-id>/, typically with:prompt.mdrequirements.yamlapp/(baseline input)reference/(reference output for judge context)
Benchmark execution uses two CLIs:
bun runner/run.tsdiscovers evals and generates artifacts under the configured output directory.bun runner/judge.tsreads generated artifacts, runs LLM judging, and writes results underresults/<run-id>/.
Generation details (runner/solver/pipeline.ts):
--modelis required and is always used for generation.
Judge details (runner/evaluators/llm/run.ts):
--modelis required and LLM judge always runs against generated artifacts.
Key output behavior:
- Per-eval results:
results/<run-id>/evals/<eval-id>.json - Summary:
results/<run-id>/summary.json --debugadds judge prompt/output artifacts.
Apply these before implementation:
- State assumptions explicitly.
- If multiple interpretations exist, list them and ask or choose with justification.
- If a simpler approach works, use it and say why.
- If something is unclear, stop and surface the exact ambiguity.
- State assumptions explicitly. If uncertain, ask.
- If multiple interpretations exist, present them; do not pick silently.
- If a simpler approach exists, call it out and use it unless there is a clear reason not to.
- Implement only what the task asks for.
- No speculative abstractions for single-use code.
- No drive-by refactors.
- Match surrounding style.
- Remove only dead code created by your own change.
- If you notice unrelated issues, mention them but do not fix unless asked.
- Remove imports/variables your changes made unused; do not clean pre-existing dead code unless requested.
- Every changed line must map directly to the request.
- Prefer minimal diff over broad rewrites.
- Keep edits local:
- runner change -> touch
runner/**(+ whitepaper if framework behavior changes) - eval content change -> touch only that eval directory
- taxonomy/methodology/scoring change -> update
paper/benchmark-methodology-whitepaper.texanddocs/**in same PR
- runner change -> touch
Pick the smallest command set that proves the change:
- Repo lint:
bun lint - Runner smoke run:
bun runner/run.ts --pattern "evals/<category>/<eval-id>/**" --model <solver-model> --output /tmp/evals-generated && bun runner/judge.ts --pattern "evals/<category>/<eval-id>/**" --model <judge-model> --input /tmp/evals-generated --debug --fail-fast - Full run (expensive):
bun runner/run.ts --model <solver-model> --output /tmp/evals-generated && bun runner/judge.ts --model <judge-model> --input /tmp/evals-generated - Unit tests (when runner logic changes):
bun test runner
For bug fixes, prefer:
- Reproduce with a focused test or focused runner command.
- Implement fix.
- Re-run the same check to verify.
For multi-step tasks, include a short step plan with a verification checkpoint per step.
docs/testing-your-evals.mdanddocs/benchmarking-selected-models.mdare currently placeholders (TBD). Do not assume they contain workflow details.testbench/is currently not wired into active runner pipeline (testbench/README.md).- Root linting ignores
evals/**; benchmark scoring is requirement-judge based and does not run extra code-quality gates on eval outputs. - Eval discovery depends on
requirements.yaml; missing that file means eval is invisible to the runner. requirements.yamlruntime validation currently enforces only:versionrequirements[]withid,description, optionalweight- extra fields (for example
inputs) are ignored by current runner parser.
When adding or updating an eval:
- Keep it self-contained in one eval folder.
- Keep prompt forward-looking (implementation ask, not bug report narrative).
- Keep requirements atomic, concrete, and judgeable from files.
- Keep
app/minimal baseline andreference/aligned with requirements. - Run a targeted pattern command for that eval before broader runs.
Use this order when deciding intent:
paper/benchmark-methodology-whitepaper.tex(methodology source of truth)runner/**source code (actual behavior)- category research docs under
evals/<category>/README.md docs/**- root
README.md
The whitepaper is the authoritative specification for benchmark methodology, scoring, pipeline stages, and eval conventions. If runner behavior and the whitepaper disagree, the whitepaper defines intended behavior.
Mandatory sync rule: any PR that changes eval framework behavior (pipeline stages, scoring logic, requirement parsing, judge methodology, artifact schema, or authoring conventions) must include a corresponding update to paper/benchmark-methodology-whitepaper.tex in the same PR. Do not merge framework changes without updating the whitepaper.
- Package manager:
bun - Formatting: single quotes, no semicolons
- Imports: sorted (
simple-import-sort) - Keep TypeScript and linting changes compatible with existing configs
- Conventional Commits:
type(scope): description - Allowed types:
feat,fix,refactor,chore,docs - Lowercase, imperative, no trailing period
- Example:
fix(runner): handle missing judge rows as failed requirements