Browser Agent Driver (bad CLI) — general-purpose agentic browser automation.
Required before merge:
pnpm lint— type-checkpnpm check:boundaries— architecture boundariespnpm test— unit + integration (549 tests)- Tier1 deterministic gate on PRs and
main - Tier2 staging gate when secrets available
Every PR with user-visible changes must include a changeset. Releases are fully automated via Changesets + OIDC publish; without a changeset file the post-merge workflow sees nothing to version and nothing publishes. Do this before opening the PR:
pnpm changeset # pick patch / minor / major, write the summary
git add .changeset/*.mdBump rules: patch for bug fixes, minor for additive features, major for breaking API/CLI changes. Changeset summaries land verbatim in CHANGELOG.md and must carry the same multi-rep numbers as the PR body (see Measurement Rigor #7). Chores, docs-only tweaks, internal refactors with no consumer impact don't need one.
Flow: PR merges to main → .github/workflows/changesets.yml opens a "Release: version packages" PR bumping package.json + CHANGELOG.md → merging that PR publishes to npm via OIDC.
No AI attribution in commits or PRs. Never add Co-Authored-By: Claude ... trailers to commit messages. Never add 🤖 Generated with [Claude Code] footers (or any equivalent AI-generated-by line) to PR bodies, issue comments, changesets, or commit messages. Commits and PRs should read as the author's own authored work. This applies to every commit and PR made in this repo, without exception.
Reliable, performant completion of real user outcomes — for both persona-driven workflows and direct task inputs.
Non-goals: over-specializing for a single app; features without measurable completion gains.
- Model:
gpt-5.4. Single-model unless--model-adaptiveis set. - Memory: ON by default. Disable with
--no-memory. - Wallet mode: only when
wallet.enabled=trueor extension paths provided. - Evidence:
fast-explorefor iteration,full-evidencefor release signoff. - General-purpose first: Tangle personas/hints are optional, never required.
Adaptive routing (--model-adaptive), trace scoring (--trace-scoring) stay flagged until non-regressive vs control.
- Tier 1 (deterministic/local): 100% required.
- Tier 2 (staging/auth): push to 100% through bug closure.
- Tier 3 (open web): tracked separately; must not regress Tier 1/2.
Track: pass rate, median duration, token usage, artifact completeness. Promotion: no pass-rate regression + meaningful latency/token improvement.
- One variable at a time. Treat each hypothesis as a challenger spec.
- Fast-explore sweeps first for broad testing. Full-evidence only for shortlisted winners.
- Seeded AB (
ab:experiment --seed <fixed>) for reproducibility. - Promote only when bootstrap CI lower bound is positive and Tier1/2 gates hold.
- Memory isolation per run during benchmarks.
- Stop early on unresolved provider quota/auth issues.
- Parallelize repetitions within one experiment. One clean experiment at a time for promotion.
- Keep pushing autonomously until baseline improves, challenger is rejected, or user input is needed.
The single-run trap is the #1 way bad changes get shipped. gpt-5.4 reasoning latency on a single complete action alone has 3–7s spread on identical inputs. Long-form runs have 20s+ run-to-run spread from normal LLM variance. Anything that doesn't survive these rules does not get merged.
-
No single-run speed/turn/cost claims. Ever. Any claim of the form "X turns / Y seconds / Z dollars" requires minimum 3 reps and must be reported as
mean (min–max), N reps. Best-case alone is forbidden in PRs, changesets, pursuit docs, and progress.md. -
Spread test before promotion. If
(challenger_mean − baseline_mean)is less than the worst-case spread of either side, the result is "comparable", not an improvement. Say so in the writeup. -
The baseline must be re-measured under the same conditions as the challenger (same scenario, same model, same day, same machine). Stale baselines from prior generations are reference points, not promotion gates.
-
scripts/run-multi-rep.mjsis the canonical multi-rep harness for ad-hoc single-config validation (one config, N reps, mean/min/max table). For A/B comparisons usenpm run ab:experiment(bootstrap CIs + Wilson). For research-pipeline experiments use--two-stage(1 rep screen → 5 rep validate). If a measurement isn't going through one of those three, you must run one of them yourself before claiming a result.node scripts/run-multi-rep.mjs \ --cases bench/scenarios/cases/long-form-dashboard.json \ --config bench/scenarios/configs/planner-on.mjs \ --reps 3 --modes fast-explore --label gen7-planner \ --out agent-results/multi-rep-gen7
-
Cost claims still need ≥3 reps. Token counts per LLM call are deterministic, but the number of LLM calls varies run-to-run, so per-run cost is variable.
-
Quality wins (pass-rate, completion accuracy) need ≥5 reps because pass/fail is binary and a single flake either way swings the rate by 20% on a 5-case set.
-
PR description = honest data. The PR body, the changeset, and the pursuit doc must all carry the same multi-rep numbers. If you find variance after writing them, you must update them before merge — overstated numbers shipped to a changeset are a release-blocker, not a "fix in next gen."
-
The summary table format (use this verbatim):
| metric | baseline (mean) | challenger (mean) | Δ | reps | min/max challenger | verdict | |---------------|-----------------|-------------------|--------|------|--------------------|------------| | wall-time | 53s | 50s | -3s | 3 | 35s / 75s | comparable | | LLM calls | 9 | 10.5 | +1.5 | 3 | 7 / 16 | comparable | | $ per run | $0.89 | $0.30 | -$0.59 | 3 | $0.22 / $0.41 | win | -
"Mechanism is sound" is not validation. A sound mechanism + a lucky run is exactly how single-run overclaims happen. Validation comes from reps, not from confidence in the design.
-
When a result smells too good (>3× best-known baseline), require ≥5 reps before writing it down anywhere. Big wins are the exact regime where variance hides.
Operating manual: docs/EVAL-RIGOR.md — three canonical commands (bench:validate, ab:experiment, research:pipeline --two-stage) + the verbatim summary table format. No PR may be merged with metric claims that did not come out of one of those three.
Automated hypothesis testing: pnpm research:pipeline --queue bench/research/<queue>.json
- Two-stage (recommended):
--two-stagescreens all hypotheses (1 rep), validates candidates (5 reps). ~40% cheaper than flat runs with better statistical power for winners. - Cost estimation:
--estimateshows expected cost before running. - Parallel hypotheses:
--hypothesis-concurrency Nruns N hypotheses simultaneously. - Filtering:
--max-priority N,--hypothesis <id>,--resumeto skip completed. - Decision logic:
promote(CI lower > 0, or neutral + efficiency gain),reject(CI upper < 0),inconclusive. - Completed hypotheses get
priority: 99+result:annotation in the queue file.
Fail fast on terminal blockers:
chrome-error://, bot challenges, missing API keys → abort immediately with reason.- API key must match provider (don't let
OPENAI_API_KEYroute toanthropic).
Page interaction:
- Dismiss cookie/consent dialogs before form submissions. Re-verify action took effect after dismissal.
- Auto-submit search forms (press Enter after typing in
searchboxrole elements). - Detect A-B-A-B oscillation (menu toggle loops) → redirect to search or direct URL.
Budget management:
- Action timeout:
min(30s, caseTimeout/8)— prevents one stuck click from exhausting the run. - Snapshot budget: filter decorative elements, 16k char cap on non-first turns.
- First-turn LLM calls must not consume the whole case budget.
Verification:
- Verifier sees
budgetSnapshot(), same as agent (not raw snapshot). - Rejection feedback escalates: first rejection → navigate to content; second+ → demand strategy change.
Benchmarks:
- Separate anti-bot/unreachable sites from core reliability scorecards.
- Supervisor should consume screenshots when available (behind config flag).
- Outer experiment concurrency
1for promotion-grade studies.
Wallet validation lives in bench/wallet/. Chromium-only (extension APIs are Chromium-specific).
Setup (one-time):
pnpm wallet:setup # download MetaMask extension
pnpm wallet:onboard # automate MetaMask first-run wizardRunning:
pnpm wallet:anvil # start Anvil mainnet fork, seed 100 ETH + 10 WETH + 10k USDC
pnpm wallet:validate # run all wallet cases
pnpm wallet:anvil:stop # stop AnvilLearned patterns:
- MetaMask 13.x onboarding: Welcome → "I have an existing wallet" → "Import using SRP" → SRP textarea → password → analytics → "Open wallet".
- MetaMask's LavaMoat blocks
page.evaluate()on extension pages. Use CDP (DOM.focus,Input.insertText) or coordinate-basedpage.mouse.click()+page.keyboard.type(). - The SRP textarea is invisible to Playwright locators (LavaMoat scuttling). CDP
DOM.querySelectorAllwithpierce: truefinds it. Focus via CDP, type viakeyboard.type()(fires React-compatible input events).insertTextsets value but doesn't trigger React onChange. - "Open wallet" button stays disabled for ~20s during background sync. Force-click or navigate directly to
home.html. - Preflight
eth_requestAccountstimes out on first visit to any dApp — expected. The agent handles wallet connection during test turns. - Hybrid RPC interception (cli.ts): only forward user-specific calls (eth_getBalance for wallet addr, eth_call/eth_estimateGas with wallet addr in from/data) to Anvil. Pool/protocol data goes to real endpoints. JSON-RPC normalization required: add
jsonrpc/id, removechainId(Aave omits standard fields). MetaMask's service worker RPC is handled separately by the HTTPS reverse proxy (rpc-proxy.mjs on port 8443) + host-resolver-rules. - Anvil fork freshness: free RPCs (publicnode, 1rpc) retain ~128 blocks (~25min) of state. Always restart Anvil before test runs.
drpc.orgis most reliable free fork RPC.run-wallet-validation.mjsauto-restarts Anvil for defi suite. - Pre-warming: setup-anvil.mjs caches Aave contract state after fork creation so Anvil doesn't need upstream for user-specific queries.
pnpm exec baddoesn't work in dev (pnpm doesn't self-link bin entries). Usenode dist/cli.jsdirectly.- Test wallet:
test test test...junkmnemonic →0xf39Fd6e51aad88F6F4ce6aB8827279cffFb92266. - Seed USDC via storage slot manipulation (
anvil_setStorageAton slot 9) — faster than whale impersonation. - DeFi validation (2026-03-12): 7/7 pass, $0.98 total, 267s. Connect: 5/5 (Uniswap, Aave, 1inch, SushiSwap). Swap: 2/2 (Uniswap, SushiSwap). Supply: 1/1 (Aave ETH).
- MetaMask RPC config: MUST keep
networkClientId: "mainnet"astype: "infura"— changing breaks MetaMask ("No Infura network client found"). Instead ADD a newtype: "custom"endpoint with UUID clientId alongside Infura, set as default. Modify LevelDB at.agent-wallet-profile/Default/Local Extension Settings/<extId>/usingclassic-level. Key:NetworkController. Also updateSelectedNetworkController.domains. - Setup order:
pnpm wallet:setup && pnpm wallet:onboard && pnpm wallet:anvil && pnpm wallet:configure && pnpm wallet:validate.
- Runtime:
--no-memory+ disable adaptive routing → control defaults. - Wallet: restore legacy activation in
src/browser-launch.tsif needed. - Full: revert feature commit on
main.
Canonical in docs/roadmap/browser-agent-ops.md.
Canonical in skills/. Install via npm run skills:install.