Skip to content

feat: merge-train/spartan#22980

Open
AztecBot wants to merge 58 commits into
nextfrom
merge-train/spartan
Open

feat: merge-train/spartan#22980
AztecBot wants to merge 58 commits into
nextfrom
merge-train/spartan

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 6, 2026

BEGIN_COMMIT_OVERRIDE
fix(test): warp L1 forward when proposer scan hits EpochNotStable (#22967)
test(e2e): fail epochs tests on proposer-rollup-check-failed (#22965)
fix: grafana switch to aztec_status="proposed" (#22978)
chore: update benchmark scraper (#22984)
test(e2e): migrate simple epoch tests to pipelining (#22973)
chore: remove top-level yarn.lock (#22987)
refactor(archiver)!: unify L2BlockSource checkpoint lookups via query objects (#22933)
fix(sequencer): bounded sweep instead of event scan for governance proposal check (#22989)
fix(docs): allow webapp-tutorial yarn install to populate empty lockfile in CI (#23000)
test(e2e): enable pipelining in l1-reorgs and mbps redistribution tests (#23009)
fix(archiver): restore pending block height metric under pipelining (#22994)
chore(p2p): remove skipped validation result option (#23034)
refactor(p2p)!: remove slow tx collection flow (#22878)
chore(spartan): add next-net-clone environment config (#22995)
chore(sequencer): add context to proposer-rollup-check-failed logs (#23071)
test(e2e): wait for archiver sync before asserting pipelining (#22997)
refactor(node-rpc)!: remove deprecated AztecNode methods and L2BlockSource tip helpers (#22934)
feat(p2p): detect and track announce IP changes at runtime (#22405)
test: mark tx_stats_bench 10 TPS as flake-retryable on merge-train/spartan (#23083)
fix(sequencer): bind vote-only multicalls to target slot under pipelining (#23090)
feat(sequencer): build optimistically across pruning epoch boundary (#23056)
fix(sequencer): use chainTipsOverride.pending for log context (#23098)
test(e2e): relax post-boundary slot assertion in epochs_proof_at_boundary (#23108)
fix(bb-prover): pool long-lived bb verifier processes instead of spawning per-call (#23093)
fix(sequencer): anchor fee asset price modifier to predicted parent (#23113)
chore: error log when L1 head timestamp drifts (#22947)
fix(sequencer): override full parent checkpoint cell in pipelined simulation (#23073)
test(e2e): enable pipelining on missed l1 slot test (#23068)
fix: more robust metrics reporting in IRM monitor (#23038)
fix: preserve LMDB slashing protection (#23145)
test(e2e): enable pipelining on p2p tests (#23070)
fix(archiver): move L2 tips cache refresh out of write transactions (#23110)
test(e2e): fix data_withholding_slash flake by freezing L1 across restart (#23162)
END_COMMIT_OVERRIDE

spypsy and others added 6 commits April 16, 2026 11:32
## Summary

- Keep `getPublicIp()` at startup so the ENR always has a valid IP from the start
- Enable discv5 `enrUpdate` with `addrVotesToUpdateEnr: 1` and faster pings (10s) when `queryForIp` is enabled, so PONG votes can correct the IP at runtime if it changes (e.g. residential ISP, Cloud NAT rotation)
- Bridge discv5 IP changes to libp2p's AddressManager so peers see updated addresses
- Have the bootnode explicitly `addEnr()` on discovery to fix routing table gaps where nodes were never inserted
- Improve P2P observability: log KAD table state in peer manager heartbeats, log ENR additions with multiaddrs, log config at startup
- Small change to deploy scripts that allows us to define a full aztec image to deploy on a network rather than just `aztecprotcool/aztec:<tag>`

Fixes [A-310](https://linear.app/aztec-labs/issue/A-310/p2p-query-for-ip-should-detect-ip-changes)

Co-authored-by: Alex Gherghisan <alexghr@users.noreply.github.com>
Co-authored-by: danielntmd <162406516+danielntmd@users.noreply.github.com>
…2967)

## Motivation

The `e2e_epochs/epochs_missed_l1_publish` test fails intermittently when
its proposer-discovery scan looks too far into the future. The L1 rollup
contract reverts with `ValidatorSelection__EpochNotStable` for any epoch
whose randao sample timestamp is still ahead of `block.timestamp`, and
the test was scanning up to 60 slots (~15 epochs at the test's epoch
duration) ahead, well past the queryable horizon.

## Approach

Wrap the proposer scan in a retry loop that catches `EpochNotStable`,
warps L1 forward by one epoch, and re-queries the same candidate. After
each warp the scan also re-anchors the candidate to keep the +4 slot
margin from the new "now", so subsequent steps (the warp to `slotZero`
and sequencer start-up) still have headroom.

## Changes

- **end-to-end (tests)**: Replace the bounded `for` loop in
`epochs_missed_l1_publish.test.ts` with a try/catch retry that warps L1
on `EpochNotStable`.
These sequencer errors were ignored in some tests. Removing that since
this error should not happen. If it does, it's cause for analysis.
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 6, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​@​types/​node@​20.19.401001008196100

View full report

spalladino and others added 7 commits May 6, 2026 16:12
… objects (#22933)

## Motivation

Clean up the checkpoint side of `L2BlockSource`. PR #22809 already
collapsed the block-side API into 4 query-shaped methods over 2 return
types; the checkpoint surface was left with the pre-refactor sprawl (9
narrow methods over 4 return shapes, parallel by-number / by-range /
by-epoch entrypoints, and a wire-level alias that conflated proposed and
confirmed checkpoints). This change applies the same simplification.

Fixes A-979

## Approach

`L2BlockSource` checkpoint methods reduce to 4 query-shaped readers
(`getCheckpoint`, `getCheckpoints`, `getCheckpointData`,
`getCheckpointsData`) over 2 return shapes (`PublishedCheckpoint`,
`CheckpointData`), plus a polymorphic
`getProposedCheckpointData(query?)` for the proposed-only path. Three
new query types live next to `BlockQuery`/`BlocksQuery`. On-disk format
and `BlockStore` primitives are unchanged — the simplification is at the
API boundary. The public RPC's `getCheckpoint` keeps the same wire
signature but gains a confirmed→proposed fallback (for
`{number}`/`{slot}`/`'proposed'` lookups) and `BadRequestError` guards
for incompatible `include*` flags.

## API surface change

### Methods removed from `L2BlockSource`

`getCheckpoints(from, limit)`, `getCheckpointData(n)`,
`getCheckpointDataRange(from, limit)`, `getCheckpointsForEpoch(epoch)`,
`getCheckpointsDataForEpoch(epoch)`, `getCheckpointNumberBySlot(slot)`,
`getLastCheckpoint()`, `getLastProposedCheckpoint()`. Dead methods on
`data_source_base` also removed: `getCheckpointHeader`,
`getLastBlockNumberInCheckpoint`, `getSynchedCheckpointNumber`.

### Methods added to `L2BlockSource`

```ts
getCheckpoint(query: CheckpointQuery): Promise<PublishedCheckpoint | undefined>
getCheckpoints(query: CheckpointsQuery): Promise<PublishedCheckpoint[]>
getCheckpointData(query: CheckpointQuery): Promise<CheckpointData | undefined>
getCheckpointsData(query: CheckpointsQuery): Promise<CheckpointData[]>
getProposedCheckpointData(query?: ProposedCheckpointQuery): Promise<ProposedCheckpointData | undefined>

type CheckpointQuery         = { number } | { slot } | { tag: 'checkpointed' | 'proven' | 'finalized' }
type CheckpointsQuery        = { from, limit } | { epoch }
type ProposedCheckpointQuery = { number } | { slot } | { tag: 'proposed' }
```

### Public RPC (`AztecNode`) wire-level changes

- `getCheckpointsDataForEpoch(epoch)` removed;
`getCheckpointsData(query: CheckpointsQuery)` added (range or epoch).
- `'latest'` removed from `CheckpointParameter`.
- `'proposed'` semantics changed: previously aliased to "latest
L1-confirmed checkpoint" (a documented foot-gun); now
`getCheckpoint('proposed')` strictly targets the proposed-checkpoint
store, and `getCheckpointNumber('proposed')` returns the proposed-tip
number with confirmed fallback.
- `getCheckpoint({ number }) / ({ slot })` now check confirmed first
then fall back to proposed; tag-based lookups (`'checkpointed'` /
`'proven'` / `'finalized'`) do not fall back.
- `getCheckpoint('proposed', { includeL1PublishInfo: true |
includeAttestations: true })` and the same flags on a by-number/by-slot
lookup that resolves to a proposed entry now throw `BadRequestError`
(proposed checkpoints have no L1 publish info or attestations).

### Types kept

`CheckpointData`, `CommonCheckpointData` (structural base of
`CheckpointData` / `ProposedCheckpointInput`), `ProposedCheckpointData`,
`ProposedCheckpointInput`, `PublishedCheckpoint`, `Checkpoint`. No
structural-type deletions.

Migration guidance for wallet/SDK consumers is in
`docs/docs-developers/docs/resources/migration_notes.md`.

## Changes

- **stdlib**: New query types (`CheckpointQuery`, `CheckpointsQuery`,
`ProposedCheckpointQuery`) + Zod schemas in `block/l2_block_source.ts`.
`'latest'` literal removed from `interfaces/checkpoint_parameter.ts`.
`NormalizedCheckpointDispatch` type for the server's parameter
normalizer. `ArchiverApiSchema` and `AztecNode` schema updated.
`computeL2ToL1MembershipWitness` switched to the new query shape.
- **archiver**: `data_source_base` adds `resolveCheckpointQuery` /
`resolveCheckpointsQuery` mirroring the block-side helpers, implements
the 4 confirmed methods plus the polymorphic proposed lookup.
`BlockStore` adds `getProposedCheckpointBySlot(slot)`. `MockArchiver`
and `mock_l2_block_source` updated to match the new interface.
- **aztec-node**: `server.ts` adds the confirmed→proposed fallback flow
with the two `BadRequestError` guards in `getCheckpoint`, sources all
tips from a single `getL2Tips()` call in `getCheckpointNumber`, and
routes the public RPC through the new internal methods. New
pure-projection helper `projectProposedToCheckpointResponse` in
`block_response_helpers.ts`.
- **consumer migrations**: prover-node (collapses two checkpoint fetches
into one `getCheckpoints({ epoch })`), world-state, slasher, sequencer
(`checkpoint_proposal_job`, `sequencer`), validator
(`proposal_handler`), `L2BlockStream`, pxe `block_stream_source`,
telemetry wrapper, and 10 e2e files updated to the new query shapes.
- **tests**: 48 new `it()` blocks covering each query discriminant, the
throw guards, the confirmed→proposed fallback, the polymorphic
`getProposedCheckpointData` dispatch, and
`BlockStore.getProposedCheckpointBySlot`.
- **docs**: `migration_notes.md` updated with the breaking changes for
downstream wallet/SDK consumers.
…oposal check (#22989)

## Motivation

`hasPayloadBeenProposed` (now `hasActiveProposalWithPayload`) used
`eth_getLogs` over the rollup's full L1 deployment range to find prior
`PayloadSubmitted` events. On long-lived rollups that range exceeds
typical RPC provider block-range caps and the call times out, silently
breaking the sequencer's "stop signaling for an already-proposed
payload" logic. The previous in-memory cache also permanently
blacklisted any payload it saw as proposed once, which is wrong: each
round on `EmpireBase` is independent and the same payload can
legitimately be re-signaled and re-submitted after a prior proposal
becomes Dropped/Rejected/Expired/Executed.

## Approach

Replace the log scan with a bounded view-call sweep over
`Governance.proposals`. The sweep walks newest -> oldest using
`proposalCount`, unwraps each proposal's `GSEPayload` via
`getOriginalPayload()`, and treats only
`Pending`/`Active`/`Queued`/`Executable` as "in an active proposal" --
terminal states allow re-signaling. The descent has a hard early-stop on
the protocol-wide proposal lifetime cap (`4 *
ConfigurationLib.TIME_UPPER = 360 days`), which is safe regardless of
per-proposal frozen configs because every config field is bounded by
`TIME_UPPER` on-chain. Two in-memory caches absorb the per-call cost
over time: terminal proposals (provably immutable on-chain) and wrapper
-> original payload unwraps (immutable bytecode).

## Changes

- **ethereum/contracts/governance**: New
`hasActiveProposalWithPayload(payload)` and `getProposalCount()` on
`ReadOnlyGovernanceContract`. Inlines a minimal `IProposerPayload` ABI
(just `getOriginalPayload`) to avoid generating a full artifact. Handles
`proposeWithLock`-style proposals (no GSEPayload wrapper) by catching
the unwrap revert and skipping.
- **ethereum/contracts/governance (types)**: Adds explicit types
(`Proposal`, `ProposalConfiguration`, `GovernanceConfiguration`,
`ProposeWithLockConfiguration`, `Ballot`) and maps the viem return
shapes of `getProposal` / `getConfiguration` onto them. `Proposal` now
carries both `cachedState` (raw stored) and `state` (live, time-derived
from `getProposalState`); `getProposal` issues both reads in parallel so
callers don't need a separate state RPC.
- **ethereum/contracts/governance (caching)**: Adds two memoization
layers on `ReadOnlyGovernanceContract`. Proposals are cached when
`state` is in any of the four terminal phases
(Executed/Rejected/Dropped/Expired) -- once terminal the entire struct
is provably immutable on-chain. Wrapper unwraps are keyed by wrapper
address and cached forever (deployed bytecode is immutable).
`GovernanceProposerContract` already memoizes its `getGovernance()`, so
the same `ReadOnlyGovernanceContract` instance (and its caches) is
reused across slots in the sequencer publisher.
- **ethereum/contracts/governance_proposer**: Drops the event-based
`hasPayloadBeenProposed`. Adds a memoized `getGovernance()` accessor and
a thin `hasActiveProposalWithPayload` delegate that resolves the
Governance address via the on-chain registry lookup.
- **ethereum/contracts/empire_base**: Removes `hasPayloadBeenProposed`
from `IEmpireBase` -- it's a Governance concern, not a generic empire
concern (slasher doesn't need it).
- **sequencer-client/publisher**: Removes the permanent
`payloadProposedCache` so the publisher re-checks every slot, allowing
re-signaling once a prior proposal is terminal. Switches the failure
mode from fail-closed to fail-open (a flaky L1 endpoint should not
silence governance participation; a duplicate signal is harmless).
Narrows the helper's `base` param from `IEmpireBase` to
`GovernanceProposerContract` since this code path is governance-only.
- **ethereum/contracts (tests)**: New `hasActiveProposalWithPayload`
describe block hitting a real anvil-deployed Governance. Impersonates
the `governanceProposer`, calls `Governance.propose` directly, and
etches hand-rolled mock wrapper bytecode at chosen addresses to drive
(wrapper, original) pairs. Covers: empty governance, live match, no
match, terminal state via warp, reverting wrapper
(proposeWithLock-style), descent past unrelated proposals,
case-insensitive match, and the 360-day hard cutoff via warp. Also adds
a sync-guard describe block that probes `Governance.updateConfiguration`
via impersonated `eth_call` to assert each of
`votingDelay`/`votingDuration`/`executionDelay`/`gracePeriod` accepts
`TIME_UPPER` and rejects `TIME_UPPER + 1` -- if those caps change
on-chain, this trips and `MAX_PROPOSAL_LIFETIME_SECONDS` must be
revisited.
- **sequencer-client/publisher (tests)**: Replaces the cache test with a
"re-checks each call so re-signaling resumes after terminal" test.
Updates the RPC-failure semantics test from fail-closed to fail-open.
…ile in CI (#23000)

## Summary

Fixes the `docs` build failure on `merge-train/spartan` (CI run
[25449092262](https://github.com/AztecProtocol/aztec-packages/actions/runs/25449092262),
log [27a4351a1e5e3568](http://ci.aztec-labs.com/27a4351a1e5e3568)).

## Problem

`validate-webapp-tutorial` in `docs/examples/bootstrap.sh` intentionally
starts each run with an empty `yarn.lock`, then runs `yarn install` to
populate it from the `link:` paths it just wrote into `package.json`. In
CI, Yarn 4 auto-enables `--immutable` when it detects `CI=1`, so the
install fails with `YN0028 (frozen lockfile exception)` because
populating an empty lockfile counts as modifying it.

```
➤ YN0028: │ The lockfile would have been modified by this install, which is explicitly forbidden.
➤ YN0000: · Failed with errors in 6s 829ms
ERROR: Contract artifact not found at /home/aztec-dev/aztec-packages/docs/target/pod_racing_contract-PodRacing.json
```

(The "Contract artifact not found" line is a downstream symptom — the
script doesn't run with `set -e`, so after `yarn install` fails it
continues into the artifact check and reports a misleading error.)

## Fix

Set `YARN_ENABLE_IMMUTABLE_INSTALLS=false` for that one `yarn install`
call, since populating the lockfile is the intended behaviour.

## Verification

Reproduced locally: `CI=true yarn install` against the webapp-tutorial
fails with `YN0028`; with `YARN_ENABLE_IMMUTABLE_INSTALLS=false` it
succeeds.

ClaudeBox log: https://claudebox.work/s/a1863de35053b544?run=1
@spalladino spalladino requested a review from a team as a code owner May 6, 2026 18:24
Copy link
Copy Markdown
Collaborator

@ludamad ludamad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Auto-approved

@AztecBot AztecBot added this pull request to the merge queue May 6, 2026
@AztecBot
Copy link
Copy Markdown
Collaborator Author

AztecBot commented May 6, 2026

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 6, 2026
AztecBot and others added 4 commits May 7, 2026 04:17
…22994)

## Motivation

The `aztec.archiver.block_height` series with no status attribute
(rendered as the "Pending chain" line on the network, prover, and
fisherman Grafana dashboards) stopped being published a couple of weeks
ago. With pipelining enabled every checkpoint arriving from L1 already
has its blocks in the proposed store, so the L1 synchronizer always took
the new promotion fast path introduced in #22716, leaving
`checkpointsToAdd` empty and skipping the metric call.

## Approach

Record the checkpointed block-height metrics across all valid
checkpoints in the batch instead of only the ones routed through
`addCheckpoints`, so the promoted checkpoint contributes too. The
duration is averaged over the full batch since `addCheckpoints` performs
the work for both paths in a single transaction.

## Changes

- **archiver (`l1_synchronizer.ts`)**: Move the
`processNewCheckpointedBlocks` call to use `validCheckpoints` rather
than `checkpointsToAdd`, restoring the empty-status `block_height`,
`checkpoint_height`, `sync_block_count`, and `sync_per_checkpoint`
series under pipelining.

---------

Co-authored-by: Alex Gherghisan <alexghr@users.noreply.github.com>
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 7, 2026
AztecBot and others added 7 commits May 7, 2026 21:28
…ource tip helpers (#22934)

## Motivation

After three back-to-back unifications of the block/checkpoint APIs
(#22781, #22809, plus the two query-object refactors on this stack),
four `@deprecated` `AztecNode` RPC methods and three redundant
`L2BlockSource` tip-number helpers had outlived their replacements and
remained only as stop-gaps. This PR retires them and migrates every
caller to the canonical query-object APIs.

## Approach

Removed `isL1ToL2MessageSynced`, `getL2Tips`, `getBlockHeader`,
`getCheckpointedBlocks` from `AztecNode`, and `getProvenBlockNumber`,
`getCheckpointedL2BlockNumber`, `getFinalizedL2BlockNumber` from
`L2BlockSource`. Callers now use `getL1ToL2MessageCheckpoint`,
`getChainTips`, `getBlock(...).header`, `getBlocks(..., {
onlyCheckpointed, includeL1PublishInfo, includeAttestations })`, and
`getBlockNumber({ tag })` respectively. `BlockIncludeOptions` was split
into a single-block variant and a `BlocksIncludeOptions` extension so
`onlyCheckpointed` is rejected at the type level on `getBlock`. Internal
`BlockStore` primitives are intentionally kept since they remain the
underlying implementation.

## Changes

- **stdlib (interfaces)**: dropped four `@deprecated` `AztecNode`
methods + their zod entries; dropped three tip-number helpers from
`L2BlockSource` and its archiver schema; split `BlockIncludeOptions`
into single- and range-block variants
- **aztec-node**: removed deprecated server impls; simplified
`getBlockNumber(tip)` to a single `getBlockNumber({ tag: tip })` call;
fixed `getL1ToL2MessageCheckpoint` to handle `messageIndex === 0n`
correctly (previously coerced to `undefined` via truthy check)
- **archiver**: dropped the now-unused tip-number passthroughs in
`data_source_base` and the `MockL2BlockSource` overrides
- **prover-node, p2p**: migrated `getProvenBlockNumber` callers to
`getBlockNumber({ tag: 'proven' })`
- **pxe**: adjusted `block_stream_source` to wrap `getChainTips()` into
the `L2Tips` shape required by `L2BlockStream`
- **txe**: added `l2TipsProvider` getter that adapts `getChainTips()`
for the TXE state machine
- **end-to-end (tests)**: migrated 15+ test files to the new APIs
(`getBlocks` with `onlyCheckpointed`/`includeTransactions` where bodies
are read, `getChainTips`, `getBlock(...).header`,
`getL1ToL2MessageCheckpoint(...) !== undefined`)
- **aztec-node, stdlib (tests)**: dropped tests of removed methods;
added unit tests covering the `messageIndex === 0n` edge case
- **docs**: updated the node-API generator to drop removed methods,
regenerated the operator API reference, and migrated `node_getL2Tips`
curl examples in the operator setup guides to `node_getChainTips`
## Summary

- Keep `getPublicIp()` at startup so the ENR always has a valid IP from
the start
- Enable discv5 `enrUpdate` with `addrVotesToUpdateEnr: 1` and faster
pings (10s) when `queryForIp` is enabled, so PONG votes can correct the
IP at runtime if it changes (e.g. residential ISP, Cloud NAT rotation)
- Bridge discv5 IP changes to libp2p's AddressManager so peers see
updated addresses
- Have the bootnode explicitly `addEnr()` on discovery to fix routing
table gaps where nodes were never inserted
- Improve P2P observability: log KAD table state in peer manager
heartbeats, log ENR additions with multiaddrs, log config at startup
- Small change to deploy scripts that allows us to define a full aztec
image to deploy on a network rather than just
`aztecprotcool/aztec:<tag>`

Fixes
[A-310](https://linear.app/aztec-labs/issue/A-310/p2p-query-for-ip-should-detect-ip-changes)
@AztecBot AztecBot requested a review from charlielye as a code owner May 8, 2026 10:50
AztecBot and others added 8 commits May 8, 2026 09:08
…artan (#23083)

## Motivation

The `verifies transactions at 10 TPS` sub-test of
[`yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts`](https://github.com/AztecProtocol/aztec-packages/blob/merge-train/spartan/yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts)
is now reliably flaking on the `bench all` step of
`merge-train/spartan`. It has fired on at least two different
merge-train commits hours apart, with no relation to either commit's
diff:

| Run | Triggering merge-train commit | CI log |
|---|---|---|
|
[25546251580](https://github.com/AztecProtocol/aztec-packages/actions/runs/25546251580)
| #22934 (refactor(node-rpc)! removing deprecated AztecNode methods) |
http://ci.aztec-labs.com/1778227975844707 |
|
[25552992890](https://github.com/AztecProtocol/aztec-packages/actions/runs/25552992890)
| #22405 (feat(p2p): detect and track announce IP changes at runtime) |
http://ci.aztec-labs.com/1778237470322975 |

Both runs hit the same assertion:

```
● transaction benchmarks › verifies transactions at 10 TPS

  expect(received).toBe(expected) // Object.is equality
  Expected: true
  Received: false
    at bench/tx_stats_bench.test.ts:268
```

Sub-test failing log on the latest run:
http://ci.aztec-labs.com/ca459ca73d02002c (`bench all` parent:
http://ci.aztec-labs.com/90616bad7bf7ebaa).

The other three sub-tests in the suite (compression; single private
verify x20 serial; single public verify x20 serial) pass cleanly against
the same proven txs in both runs. The failure is in the stress sub-test
that fires 600 IVC verifications at 10/s with 8 concurrent IVC verifiers
(`BB_NUM_IVC_VERIFIERS=8`, `BB_IVC_CONCURRENCY=1`). At least one
verification returns `valid: false` under load.

## Cause

Neither triggering commit touches the IVC verifier path:
- #22934 is a pure node-rpc surface refactor.
- #22405 is p2p / discv5 ENR plumbing.

The two failures sharing this signature across unrelated diffs is strong
evidence that the flake is independent of the merge-train commit and
stems from the bench infrastructure itself.

The likely culprit is the recent bb-prover migration to the bb.js
`NativeUnixSocket` backend (#21564), which spawns a fresh bb subprocess
per Chonk verification via `withVerifierInstance`. Under 8x parallel
verifications on the CPU-isolated bench host (each verifier requesting
16 threads, 8 × 16 = 128 threads on 56 isolated cores), transient
verifier failures appear. The bench-output log shows continuous `bb.js -
Received signal 15, shutting down gracefully...` traffic during the 10
TPS phase — verifier instances are being torn down rapidly, and at least
one verification slips through with a stale/incomplete response. Because
the serial sub-tests (`numIterations = 20` sequential) pass cleanly in
both runs, this is a stress-only interaction, not a correctness
regression.

## Approach

Add `tx_stats_bench` to `.test_patterns.yml` with an `error_regex`
anchored to the test file's stack-trace line
(`tx_stats_bench.test.ts:<line>:<col>`), and assign `*charlie` as owner
(author of the bb.js migration). With this entry, `ci3/run_test_cmd`
retries the test once on failure and treats a single retry-pass as a
flake instead of a hard fail, unblocking the merge train for unrelated
commits while Charlie investigates the underlying concurrency
interaction with the bb.js backend.

The `error_regex` is intentionally narrow (file + line + column from the
stack trace) so other ways tx_stats_bench could fail (timeout, OOM,
infra) are still surfaced as hard fails.

## Changes

- `.test_patterns.yml`: add a `tx_stats_bench` entry with an error_regex
anchored to the test file's stack-trace line and `*charlie` as owner.

ClaudeBox logs:
- https://claudebox.work/s/6e7853d3a073145f?run=1 (initial diagnosis on
#22934 failure)
- https://claudebox.work/s/c12a360275f05ad3?run=1 (this update on #22405
recurrence)
…ning (#23090)

Slashing votes are EIP-712-signed for `targetSlot` (the pipelined
proposal slot, not the wall-clock slot) and submitted via
Multicall3.aggregate3 with allowFailure: true. The contract verifies the
signature against getCurrentSlot() derived from block.timestamp, so the
multicall must mine in the slot the vote was signed for or the inner
sub-call reverts silently and VoteCast is never emitted.

Two paths in the sequencer were sending vote-only multicalls without
delaying submission to the target-slot start:

1. CheckpointProposalJob.execute() if (!broadcast) branch — proposer
enqueued votes but did not build a checkpoint.
2. Sequencer.tryVoteWhenSyncFails — proposer enqueued votes in a slot
where archiver sync had not caught up.

Both now route through `sendRequestsAt(getTimestampForSlot(targetSlot))`
when proposer pipelining is enabled. The sync-failure path uses
fire-and-forget so the wait does not block the sequencer's work loop.
…23056)

## Motivation

At a pruning epoch boundary, today's `canProposeAtTime` simulation
pre-emptively reverts when an unproven epoch's deadline is about to
expire — even if the proof lands seconds later. The slot is silently
skipped. This loses a checkpoint window for no good reason: the
publisher's preCheck right before L1 submission is the authoritative
gate.

Similarly, the simulation overrides applied to the preCheck flight due
to pipelining (as in overriding the pending chain with the last mined
slot) meant that we were silently missing the case where the epoch prune
did trigger, so we were sending the tx and reverting. This is fixed by
having different plans for the first simulation and the
right-before-submission simulation. That said, sequencer publisher
checks are a bit convoluted now, so I'm making a pass at them to try and
simplify in a later PR.

## Approach

Apply a proven-override at the three pre-submission simulation sites
(`canProposeAt`, the globals builder, and enqueue-time
`validateCheckpointForSubmission`) that forces `pending == proven` so
`STFLib.canPruneAtTime` short-circuits to false. Submission's preCheck
runs without the override against real L1 state and decides whether to
actually send. A new structured `preparing-checkpoint` sequencer event
surfaces the override/parent state for tests. Tip storage now goes
through a single `makeChainTipsOverride` to avoid same-slot state-diff
clobbering.

## Changes

- **archiver**: `isPruneDueAtSlot(slot)` on `L2BlockSource` replicates
`STFLib.canPruneAtTime` locally (no L1 RPC).
- **ethereum**: `RollupContract.makeChainTipsOverride({pending?,
proven?})` writes a single combined state-diff and guards `proven >
pending`. `forPendingCheckpoint(n)` → `withChainTips({pending?,
proven?})` on the simulation overrides builder.
- **sequencer-client (publisher)**: `enqueueProposeCheckpoint` accepts
`preCheckSimulationOverridesPlan` separately from
`simulationOverridesPlan`; the preCheck closure uses it (no fallback) so
the parent / proven overrides never reach pre-send validation.
- **sequencer-client (sequencer)**: applies the proven override at the
canProposeAt site, plumbs it through `prunePending` to
`CheckpointProposalJob` so the globals builder and enqueue-time
validation see it. New `pauseProposingForSlots` test-only config.
- **sequencer-client (events)**: new `preparing-checkpoint` event with
`targetSlot`, `checkpointNumber`, `hadProposedParent`, `provenOverride`.
- **ethereum (test infra)**: `Delayer.pauseNextTxUntil*` accept a
per-call timeout to support boundary tests that need to wait > 180s.
- **end-to-end (new tests)**:
`epochs_proof_at_boundary.parallel.test.ts` covers smoke + four boundary
scenarios — proof lands during pipeline sleep; proof lands well before
deadline; proof never lands (with parent); proof lands / never lands
without proposed parent — using structured events and `retryUntil`
rather than log greps.
- **stdlib + interfaces**: schemas and configs updated for the new RPC
method and the new sequencer config knob.
## Summary

Fixes a build error in
`sequencer-client/src/sequencer/sequencer.ts:454`:

```
error TS2551: Property 'pendingCheckpointNumber' does not exist on type 'SimulationOverridesPlan'. Did you mean 'pendingCheckpointState'?
```

The `pendingCheckpointNumber` field was removed from
`SimulationOverridesPlan` and replaced with `chainTipsOverride.pending`.
The log context in `proposeContext` was still referencing the old field.
Updated the reference to use
`simulationOverridesPlan?.chainTipsOverride?.pending`, matching the
existing usage on line 438.

## Test plan

- `yarn workspace @aztec/sequencer-client build` succeeds
…dary (#23108)

## Summary

Fixes flaky CI on `merge-train/spartan`
([run](https://github.com/AztecProtocol/aztec-packages/actions/runs/25570963690),
[log](http://ci.aztec-labs.com/1778262953204813)) where
`epochs_proof_at_boundary.parallel.test.ts > proof never lands so no
checkpoint submission is attempted` failed with:

```
expect(received).toBe(expected)
  Expected: 31
  Received: 32
> 312 |     expect(Number(firstPostBoundary.slot)).toBe(Number(boundarySlot) + 1);
```

## Root cause

The assertion's inline comment explicitly acknowledges this is
*empirical*: whether the on-chain prune fires in-tx at `boundarySlot+1`
or only at `boundarySlot+2` depends on real-time L1 / proposer-rebuild
timing. In this run, slot 31's pipelined propose still failed
(`Rollup__InvalidArchive`) and slot 32 was the first slot where the
propose was accepted and the checkpoint published.

The merge-train head — #23098 (one-line log-context fix) — cannot
influence this timing. The flake originated from #23056
(`feat(sequencer): build optimistically across pruning epoch boundary`)
earlier in the same train.

## Fix

Relax `toBe(boundarySlot + 1)` → `toBeLessThanOrEqual(boundarySlot + 2)`
for both the no-parent and with-parent variants of "proof never lands".
The lower bound is already enforced by
`waitForFirstCheckpointAfterBoundary` filtering for `slot >
boundarySlot`. The test's intent (a checkpoint lands in the new epoch
shortly after the boundary) is preserved.

The other two boundary tests where the proof DOES land use
`checkpointNumber >= boundaryPublished.checkpoint`, not slot equality,
so they aren't affected.

Full analysis:
https://gist.github.com/AztecBot/b4010e694332cca93a51024915867e9a

## Test plan

CI on this PR. The container ClaudeBox runs in lacks docker / writeable
cache, so local `./bootstrap.sh ci` could not be executed.


ClaudeBox log: https://claudebox.work/s/d49b46d7e0cb49a6?run=1
The ArchiverDataStoreUpdater used to call `l2TipsCache.refresh()` inside
the `db.transactionAsync()` callback for every writer path. Two issues:

1. Mid-tx visibility. `refresh()` reassigns its internal #tipsPromise
   synchronously, which was observable to other callers before LMDB had
   actually committed. A concurrent reader calling `getL2Tips()` after
   the reassignment but before commit picks up a promise loaded against
   the in-flight tx state, while a sibling read on
   `#proposedCheckpoints` directly outside the tx still sees pre-commit
   state — split-snapshot reads in the sequencer's `checkSync()`.

2. No rollback on tx abort. If the LMDB transaction threw or aborted,
   the cache had already been replaced with a promise loaded against
   in-flight writes that would never commit. Future readers would see a
   cache reflecting rolled-back state.

Refresh now runs after the writer transaction has fully committed, so
it loads from the committed store and is never replaced when the writer
aborts.

This does not close the JS-side race window completely — there is still
a small "tips lag store" window between LMDB commit returning and
`refresh()` finishing its `loadFromStore`. The sequencer's `checkSync()`
consistency checks (sequencer-client/src/sequencer/sequencer.ts ~L700)
already handle that residual window by detecting the mismatch and
returning undefined; those checks are intentionally left in place.
@AztecBot AztecBot enabled auto-merge May 8, 2026 23:18
@AztecBot
Copy link
Copy Markdown
Collaborator Author

AztecBot commented May 8, 2026

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

…ning per-call (#23093)

## Why

Follow-up to #21564 (bb-prover bb.js migration) addressing the IVC
verification perf regression that surfaced in `tx_stats_bench`.

The migration kept the legacy spawn-per-verification model: every
chonk/ultra-honk verification through `BBCircuitVerifier` spawned a
fresh `bb` process and SIGTERMed it after one proof.
`BB_NUM_IVC_VERIFIERS=8` only capped concurrency at the queue layer
(`QueuedIVCVerifier`), not the number of bb processes.

That made the bench spawn ~600 bb processes over its 60s 10 TPS phase
inside an 8-CPU isolate. Two compounding problems:

1. ~50–100 ms of `bb` startup tax on every verification's hot path.
2. The bind→listen race in `NativeUnixSocket`: bb's socket file appears
after `bind()` but before `listen()`. A TS `connect()` landing in that
window gets `ECONNREFUSED`. Vanishingly rare under low load; reliable
flake under contention. Diagnosis at
http://ci.aztec-labs.com/735256f13a268733.

## What

### Make `BB_NUM_IVC_VERIFIERS` mean what its name says (commits
aa99817, 0f4cb77)

Pool of long-lived bb verifier processes instead of fresh-per-call. The
factory class is renamed `BBJsProverFactory` → `BBJsFactory` (it's used
for both proving and verifying) and given a single `getInstance():
Promise<BBJsApi & AsyncDisposable>` method:

- `new BBJsFactory(path)` → no pool. Every `getInstance()` spawns a
fresh bb that is destroyed on dispose. Same as the previous
`withFreshInstance` behaviour — used by `BBNativeRollupProver`, the AVM
proving tester, and ivc-integration helpers, so their semantics are
unchanged.
- `new BBJsFactory(path, { poolSize: N })` → pool of N long-lived bb
processes, lazily spawned on first acquire. Used by `BBCircuitVerifier`
with `poolSize: numConcurrentIVCVerifiers`.

Callers use `await using inst = await factory.getInstance()` for
RAII-style release, matching the codebase's preference for
`AsyncDisposable`. `BBCircuitVerifier.stop` (already wired through to
aztec-node shutdown) tears the pool down.

### Close the bind→listen race in bb.js (commit 8e519b0)

`barretenberg/ts/src/bb_backends/node/native_socket.ts`: retry
`connect()` on `ECONNREFUSED` with exponential backoff (capped at 50 ms)
up to the existing 5 s budget. Other socket errors fail fast as before.
Pool startup still spawns N bb processes in parallel, so the race
surface is reduced from ~600 to N — the retry handles the residual.

### Server-side Chonk proof split (commit 97577cf)

`splitChonkProofToStructured` in TS had three hand-maintained constants
(`MERGE_PROOF_SIZE`, `ECCVM_PROOF_LENGTH`, `JOINT_PROOF_LENGTH`)
duplicating C++ values. When C++ shifted Chonk layout (e.g. databus
relation changes shrinking the oink portion in the previous round of
regressions), these went stale and verification failed deep in the
verifier with an opaque "OinkVerifier: num_public_inputs mismatch with
VK".

Add a new `ChonkVerifyFromFields` bbapi command that takes a flat
`Vec<bb::fr>` and calls `ChonkProof::from_field_elements` server-side,
then runs the verifier. The TS layer now passes flat fields straight
through — no layout knowledge, no hand-maintained constants.

- `bbapi_chonk.{hpp,cpp}`: new struct + `execute()`.
- `bbapi_execute.hpp`: register the variant.
- `bb_js_backend.ts`: `verifyChonkProof` calls the new API;
`splitChonkProofToStructured` and the 3 constants are deleted.

### Disposal robustness (commit 5cde220)

The first cut of `BBJsFactory` had three `.catch(() => {})` clauses that
silently swallowed bb `destroy()` errors, and an `initPool()` that
dropped already-spawned bb children if a sibling creation failed
(`Promise.all` short-circuit). Both would manifest as the Jest "worker
failed to exit gracefully" warning we hit on one test run.

Now: destroy errors propagate (`AggregateError` for the pool path);
`initPool` uses `allSettled` and tears down anything it spawned if any
sibling rejects.

### Playground bundle size (commit 1681d33)

The new `ChonkVerifyFromFields` bbapi variant tipped the playground main
entrypoint over the 1750 KB hard limit. Bumped to 1800 with a bump-log
entry.

## Effect

- `tx_stats_bench`: 600 bb spawns → 8 bb spawns at boot, then 8
long-lived processes serve every verification. The bind→listen race
surface drops 75×, *and* the residual is handled by the connect retry.
Per-call ~50–100 ms `bb` startup cost disappears from the verifier hot
path.
- Brittle TS Chonk constants are gone — Chonk layout changes in C++ can
no longer manifest as opaque verifier errors in TS.
- Disposal failures surface instead of leaking bb children.
- Behaviour for proving paths (`BBNativeRollupProver`, AVM tests,
ivc-integration) is unchanged — they still spawn fresh per call.

ClaudeBox log: https://claudebox.work/s/2d65052b0deaeab2?run=3

---------

Co-authored-by: Charlie <5764343+charlielye@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@spalladino spalladino requested a review from Thunkar as a code owner May 11, 2026 12:26
spalladino and others added 9 commits May 11, 2026 09:30
…23113)

## Motivation

Under proposer pipelining, checkpoint N's fee asset price modifier is
computed in slot N-1 before checkpoint N-1 has landed on L1. The
proposer was reading `rollupContract.getEthPerFeeAsset()`, which still
reflects the latest published checkpoint (commonly N-2), while L1 later
applies the modifier against checkpoint N-1's `ethPerFeeAsset`. The
mismatch produced a 1-checkpoint drift between the proposer's intended
new price and the price L1 actually stored, causing the e2e
price-convergence test to oscillate around the target instead of
converging.

## Approach

Threads the predicted parent's `ethPerFeeAsset` through to the modifier
computation. `buildPipelinedParentSimulationOverridesPlan` already
derives that fee header for global-variable simulation overrides; the
pending fee header on the resulting plan is used as the reference price
for the bps calculation. Non-pipelined paths and genesis (checkpoint <
2) fall back to today's `getEthPerFeeAsset()` read.

## Changes

- **ethereum**: `FeeAssetPriceOracle.computePriceModifier()` now takes
an optional `currentPriceE12`; when supplied, the L1 read is skipped.
- **sequencer-client**: `SequencerPublisher.getFeeAssetPriceModifier()`
forwards the optional predicted price; `checkpoint_proposal_job` reads
it from the pipelined simulation overrides plan and passes it through.
- **end-to-end**: enables proposer pipelining + `inboxLag: 2` in
`fee_asset_price_oracle_gossip.test.ts`.
- **ethereum (tests)**: adds 3 unit tests covering the L1-read fallback,
the predicted-price short-circuit, and concrete-value consistency with
`RollupContract.computeChildFeeHeader` (asserting modifier truncation
leaves a sub-bp gap to target).

Note: if pipelining is enabled at checkpoint ≥ 2 but
`proposedCheckpointData` is missing, the predicted-parent will be
`undefined` and the code silently falls back to the stale L1 read.
That's a pre-existing failure-mode behavior, not introduced here.
…ulation (#23073)

## Motivation

When proposer pipelining is enabled, the sequencer simulates `propose()`
for checkpoint K one slot ahead while K-1 has not yet landed on L1. The
previous override only patched `tips.pending`, `archives[K-1]`, and
(sometimes) the fee header, leaving the rest of
`tempCheckpointLogs[K-1]` at storage zero. With `slotNumber` zeroed,
`canPruneAtTime` falsely declared the proof window expired, the contract
returned `proven` from `getEffectivePendingCheckpointNumber`, and the
precheck reverted with `Rollup__InvalidArchive` — surfacing as a
`proposer-rollup-check-failed` storm whenever a checkpoint took an extra
L1 block to land.

Additionally, fixes a bug in L2-to-L1 messages related to how the
`outHash` is computed by the proposer (see "include parent
checkpointOutHash when pipelining same-epoch builds").

Also adds sanity checks to `checkSync` to guard against race conditions
when querying archiver data.

## Approach

The simulated `tempCheckpointLogs[K-1]` cell is now byte-faithful with
what L1 will see once K-1 actually lands: header hash, out hash, payload
digest, slot number, and fee header. `blobCommitmentsHash` and
`attestationsHash` are intentionally left out — the propose path never
asserts on them. The override is built through a single per-cell helper
that throws on `slotNumber > uint32`, mirroring the on-chain
`SafeCast.toUint32`.

## Changes

- **stdlib (`checkpoint/digest.ts`)**: new shared
`computeCheckpointPayloadDigest` helper. Archiver migrated to it.
- **ethereum (`rollup.ts` / `chain_state_override.ts`)**: replaces
`makeFeeHeaderOverride` with `makeTempCheckpointLogOverride`
(all-required) and `makeTempCheckpointLogPartialOverride` (subset).
Extends `PendingCheckpointOverrideState` and
`SimulationOverridesBuilder` with
`withPendingHeaderHash/OutHash/PayloadDigest/SlotNumber`. Plan
translation now goes through the partial helper so a missing fee header
no longer suppresses the rest.
- **sequencer-client**: `buildPipelinedParentSimulationOverridesPlan`
takes a `signatureContext`, populates the new fields when
`proposedCheckpointData` matches the parent, and guards against stale
entries. The inline override in `Sequencer` is consolidated through the
helper, with a defensive archive fallback when `proposedCheckpointData`
is absent. `CheckpointProposalJob` threads the signature context
through.
- **end-to-end (`epochs_mbps.parallel.test`)**: switches the test to the
pipelined-MBPS timing (12s L1 / 72s L2 / 5500ms blocks,
`enableProposerPipelining: true`, `perBlockAllocationMultiplier: 8`) and
asserts there are no `proposer-rollup-check-failed` events under normal
operation.
- **.test_patterns.yml**: marks the L2-to-L1-messages variant of the
test as `skip: true` for an unrelated `Tx dropped by P2P node` flake
under the new pipelined timing — tracked as a follow-up.
- **tests**: new unit tests for `makeTempCheckpointLogOverride`
(storage-slot round-trip via `getCheckpoint`, slot-overflow throw,
partial-emission), `withPending*` builders, and the
populated/empty/stale-checkpoint paths in
`buildPipelinedParentSimulationOverridesPlan`.
Enable pipelining on the missed L1 slot e2e test
Fixes an issue where stuck requests would update the gauge after it was
already updated by subsequent requests that succeeded quickly.
This forces the gauge to always be updated in sequence, or the result is
just dropped

Also added some logging so we can see what's happening
- Preserve local validator slashing-protection records across the known
LMDB schema 1 -> 2 migration.
- Add a fail-closed schema mismatch policy for versioned stores and wire
it into signing protection.
- Add regression coverage for preserving legacy duty records and
refusing newer stored schemas.

Fixes
[A-1029](https://linear.app/aztec-labs/issue/A-1029/prevent-lmdb-slashing-protection-reset-on-schema-mismatch)
Fixes AztecProtocol/aztec-claude#888
Toggle pipelining on all e2e p2p tests
…23110)

Move every `l2TipsCache.refresh()` call in `ArchiverDataStoreUpdater`
out of the surrounding `db.transactionAsync` callback and into the
post-commit code path. This addresses two issues with the previous
in-transaction refresh:

1. **Mid-tx visibility.** `L2TipsCache.refresh()` reassigns its internal
`#tipsPromise` synchronously, which was observable to other callers
before LMDB committed. A concurrent reader calling `getL2Tips()` after
the reassignment but before commit would pick up a promise loaded
against in-flight tx state, while a sibling read on the store directly
outside the tx still saw pre-commit state.
2. **No rollback on tx abort.** If the LMDB transaction threw or
aborted, the cache was already replaced with a promise loaded against
in-flight writes that would never commit. Future readers saw a cache
reflecting rolled-back state.

Refresh now runs after the writer transaction has fully committed, so it
loads from the committed store and is never replaced when the writer
aborts.

## Notes

- This intentionally leaves a narrow JS-side race window between LMDB
commit returning and `refresh()` finishing its `loadFromStore`.
…tart (#23162)

## Motivation

`e2e_p2p_data_withholding_slash` was flaky because L1 raced past the
epoch-8 prune deadline (`aztecProofSubmissionEpochs=0` makes the
deadline ~32s after slot 17) while we stopped, wiped, and recreated the
4 validators (~28s). The recreated archivers detected the prune during
their initial L1 sync and emitted `L2PruneUnproven` for epoch 8 with the
original tx-carrying block, but `EpochPruneWatcher.start()` is only
invoked inside `void archiver.waitForInitialSync().then(...)` in
`aztec-node/server.ts`, so the listener wasn't attached yet and the
event dropped silently. The recreated validators then built an empty
epoch 10 on top of genesis which pruned cleanly later, producing 4
`VALID_EPOCH_PRUNED` offenses instead of the expected 4
`DATA_WITHHOLDING`.

## Approach

Pause anvil block production between `removeInitialNode` and `stopNodes`
so L1 stays inside epoch 8 across the recreate gap. The recreated
archivers then ingest checkpoint 1 cleanly during initial sync (no prune
fires, nothing to miss), `EpochPruneWatcher.start()` attaches its
listener, and we resume L1 with an explicit warp + mine + interval
restart so the deadline crossing is deterministic — the prune now fires
while the watcher is live, producing `DATA_WITHHOLDING` for epoch 8 as
the test expects. A `getCurrentEpoch < 9` assertion right after pausing
fails fast if the timing window ever tightens further.

## Changes

- **end-to-end (tests)**: in `data_withholding_slash.test.ts`, pause L1
mining after `removeInitialNode` and before `stopNodes`; resume after
`waitForP2PMeshConnectivity` by warping to current wall-clock time,
mining one L1 block, and restoring interval mining. Add a fail-fast
assertion that we are still in epoch 8 when we pause.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants