AGENTS Guide

This file is the practical playbook for agents working in this repo.

What this repo is

Purpose: benchmark how coding models solve React Native tasks.
Primary engine: runner/ orchestrates discovery -> solve -> LLM judge -> summary.
Dataset: evals under evals/<category>/<eval-id>/, typically with:
- prompt.md
- requirements.yaml
- app/ (baseline input)
- reference/ (reference output for judge context)

Mental model of execution

Benchmark execution uses two CLIs:

bun runner/run.ts discovers evals and generates artifacts under the configured output directory.
bun runner/judge.ts reads generated artifacts, runs LLM judging, and writes results under results/<run-id>/.

Generation details (runner/solver/pipeline.ts):

--model is required and is always used for generation.

Judge details (runner/evaluators/llm/run.ts):

--model is required and LLM judge always runs against generated artifacts.

Key output behavior:

Per-eval results: results/<run-id>/evals/<eval-id>.json
Summary: results/<run-id>/summary.json
--debug adds judge prompt/output artifacts.

Think-before-coding rules (required)

Apply these before implementation:

State assumptions explicitly.
If multiple interpretations exist, list them and ask or choose with justification.
If a simpler approach works, use it and say why.
If something is unclear, stop and surface the exact ambiguity.

Execution principles (default policy)

Before implementing

State assumptions explicitly. If uncertain, ask.
If multiple interpretations exist, present them; do not pick silently.
If a simpler approach exists, call it out and use it unless there is a clear reason not to.

Simplicity and scope control

Implement only what the task asks for.
No speculative abstractions for single-use code.
No drive-by refactors.
Match surrounding style.
Remove only dead code created by your own change.
If you notice unrelated issues, mention them but do not fix unless asked.
Remove imports/variables your changes made unused; do not clean pre-existing dead code unless requested.

Surgical editing rules

Every changed line must map directly to the request.
Prefer minimal diff over broad rewrites.
Keep edits local:
- runner change -> touch runner/** (+ whitepaper if framework behavior changes)
- eval content change -> touch only that eval directory
- taxonomy/methodology/scoring change -> update paper/benchmark-methodology-whitepaper.tex and docs/** in same PR

Verification strategy

Pick the smallest command set that proves the change:

Repo lint: bun lint
Runner smoke run: bun runner/run.ts --pattern "evals/<category>/<eval-id>/**" --model <solver-model> --output /tmp/evals-generated && bun runner/judge.ts --pattern "evals/<category>/<eval-id>/**" --model <judge-model> --input /tmp/evals-generated --debug --fail-fast
Full run (expensive): bun runner/run.ts --model <solver-model> --output /tmp/evals-generated && bun runner/judge.ts --model <judge-model> --input /tmp/evals-generated
Unit tests (when runner logic changes): bun test runner

For bug fixes, prefer:

Reproduce with a focused test or focused runner command.
Implement fix.
Re-run the same check to verify.

For multi-step tasks, include a short step plan with a verification checkpoint per step.

Codebase-specific gotchas

docs/testing-your-evals.md and docs/benchmarking-selected-models.md are currently placeholders (TBD). Do not assume they contain workflow details.
testbench/ is currently not wired into active runner pipeline (testbench/README.md).
Root linting ignores evals/**; benchmark scoring is requirement-judge based and does not run extra code-quality gates on eval outputs.
Eval discovery depends on requirements.yaml; missing that file means eval is invisible to the runner.
requirements.yaml runtime validation currently enforces only:
- version
- requirements[] with id, description, optional weight
- extra fields (for example inputs) are ignored by current runner parser.

Editing evals efficiently

When adding or updating an eval:

Keep it self-contained in one eval folder.
Keep prompt forward-looking (implementation ask, not bug report narrative).
Keep requirements atomic, concrete, and judgeable from files.
Keep app/ minimal baseline and reference/ aligned with requirements.
Run a targeted pattern command for that eval before broader runs.

Documentation truth hierarchy

Use this order when deciding intent:

paper/benchmark-methodology-whitepaper.tex (methodology source of truth)
runner/** source code (actual behavior)
category research docs under evals/<category>/README.md
docs/**
root README.md

The whitepaper is the authoritative specification for benchmark methodology, scoring, pipeline stages, and eval conventions. If runner behavior and the whitepaper disagree, the whitepaper defines intended behavior.

Mandatory sync rule: any PR that changes eval framework behavior (pipeline stages, scoring logic, requirement parsing, judge methodology, artifact schema, or authoring conventions) must include a corresponding update to paper/benchmark-methodology-whitepaper.tex in the same PR. Do not merge framework changes without updating the whitepaper.

Style and tooling

Package manager: bun
Formatting: single quotes, no semicolons
Imports: sorted (simple-import-sort)
Keep TypeScript and linting changes compatible with existing configs

Commit messages

Conventional Commits: type(scope): description
Allowed types: feat, fix, refactor, chore, docs
Lowercase, imperative, no trailing period
Example: fix(runner): handle missing judge rows as failed requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS Guide

What this repo is

Mental model of execution

Think-before-coding rules (required)

Execution principles (default policy)

Before implementing

Simplicity and scope control

Surgical editing rules

Verification strategy

Codebase-specific gotchas

Editing evals efficiently

Documentation truth hierarchy

Style and tooling

Commit messages

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS Guide

What this repo is

Mental model of execution

Think-before-coding rules (required)

Execution principles (default policy)

Before implementing

Simplicity and scope control

Surgical editing rules

Verification strategy

Codebase-specific gotchas

Editing evals efficiently

Documentation truth hierarchy

Style and tooling

Commit messages