feat: merge-train/spartan#22980
Open
AztecBot wants to merge 58 commits into
Open
Conversation
## Summary - Keep `getPublicIp()` at startup so the ENR always has a valid IP from the start - Enable discv5 `enrUpdate` with `addrVotesToUpdateEnr: 1` and faster pings (10s) when `queryForIp` is enabled, so PONG votes can correct the IP at runtime if it changes (e.g. residential ISP, Cloud NAT rotation) - Bridge discv5 IP changes to libp2p's AddressManager so peers see updated addresses - Have the bootnode explicitly `addEnr()` on discovery to fix routing table gaps where nodes were never inserted - Improve P2P observability: log KAD table state in peer manager heartbeats, log ENR additions with multiaddrs, log config at startup - Small change to deploy scripts that allows us to define a full aztec image to deploy on a network rather than just `aztecprotcool/aztec:<tag>` Fixes [A-310](https://linear.app/aztec-labs/issue/A-310/p2p-query-for-ip-should-detect-ip-changes) Co-authored-by: Alex Gherghisan <alexghr@users.noreply.github.com> Co-authored-by: danielntmd <162406516+danielntmd@users.noreply.github.com>
…2967) ## Motivation The `e2e_epochs/epochs_missed_l1_publish` test fails intermittently when its proposer-discovery scan looks too far into the future. The L1 rollup contract reverts with `ValidatorSelection__EpochNotStable` for any epoch whose randao sample timestamp is still ahead of `block.timestamp`, and the test was scanning up to 60 slots (~15 epochs at the test's epoch duration) ahead, well past the queryable horizon. ## Approach Wrap the proposer scan in a retry loop that catches `EpochNotStable`, warps L1 forward by one epoch, and re-queries the same candidate. After each warp the scan also re-anchors the candidate to keep the +4 slot margin from the new "now", so subsequent steps (the warp to `slotZero` and sequencer start-up) still have headroom. ## Changes - **end-to-end (tests)**: Replace the bounded `for` loop in `epochs_missed_l1_publish.test.ts` with a try/catch retry that warps L1 on `EpochNotStable`.
These sequencer errors were ignored in some tests. Removing that since this error should not happen. If it does, it's cause for analysis.
Enable pipelining on `epochs_first_slot` and `simple_block_building`
Had been accidentally introduced in #22759
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
… objects (#22933) ## Motivation Clean up the checkpoint side of `L2BlockSource`. PR #22809 already collapsed the block-side API into 4 query-shaped methods over 2 return types; the checkpoint surface was left with the pre-refactor sprawl (9 narrow methods over 4 return shapes, parallel by-number / by-range / by-epoch entrypoints, and a wire-level alias that conflated proposed and confirmed checkpoints). This change applies the same simplification. Fixes A-979 ## Approach `L2BlockSource` checkpoint methods reduce to 4 query-shaped readers (`getCheckpoint`, `getCheckpoints`, `getCheckpointData`, `getCheckpointsData`) over 2 return shapes (`PublishedCheckpoint`, `CheckpointData`), plus a polymorphic `getProposedCheckpointData(query?)` for the proposed-only path. Three new query types live next to `BlockQuery`/`BlocksQuery`. On-disk format and `BlockStore` primitives are unchanged — the simplification is at the API boundary. The public RPC's `getCheckpoint` keeps the same wire signature but gains a confirmed→proposed fallback (for `{number}`/`{slot}`/`'proposed'` lookups) and `BadRequestError` guards for incompatible `include*` flags. ## API surface change ### Methods removed from `L2BlockSource` `getCheckpoints(from, limit)`, `getCheckpointData(n)`, `getCheckpointDataRange(from, limit)`, `getCheckpointsForEpoch(epoch)`, `getCheckpointsDataForEpoch(epoch)`, `getCheckpointNumberBySlot(slot)`, `getLastCheckpoint()`, `getLastProposedCheckpoint()`. Dead methods on `data_source_base` also removed: `getCheckpointHeader`, `getLastBlockNumberInCheckpoint`, `getSynchedCheckpointNumber`. ### Methods added to `L2BlockSource` ```ts getCheckpoint(query: CheckpointQuery): Promise<PublishedCheckpoint | undefined> getCheckpoints(query: CheckpointsQuery): Promise<PublishedCheckpoint[]> getCheckpointData(query: CheckpointQuery): Promise<CheckpointData | undefined> getCheckpointsData(query: CheckpointsQuery): Promise<CheckpointData[]> getProposedCheckpointData(query?: ProposedCheckpointQuery): Promise<ProposedCheckpointData | undefined> type CheckpointQuery = { number } | { slot } | { tag: 'checkpointed' | 'proven' | 'finalized' } type CheckpointsQuery = { from, limit } | { epoch } type ProposedCheckpointQuery = { number } | { slot } | { tag: 'proposed' } ``` ### Public RPC (`AztecNode`) wire-level changes - `getCheckpointsDataForEpoch(epoch)` removed; `getCheckpointsData(query: CheckpointsQuery)` added (range or epoch). - `'latest'` removed from `CheckpointParameter`. - `'proposed'` semantics changed: previously aliased to "latest L1-confirmed checkpoint" (a documented foot-gun); now `getCheckpoint('proposed')` strictly targets the proposed-checkpoint store, and `getCheckpointNumber('proposed')` returns the proposed-tip number with confirmed fallback. - `getCheckpoint({ number }) / ({ slot })` now check confirmed first then fall back to proposed; tag-based lookups (`'checkpointed'` / `'proven'` / `'finalized'`) do not fall back. - `getCheckpoint('proposed', { includeL1PublishInfo: true | includeAttestations: true })` and the same flags on a by-number/by-slot lookup that resolves to a proposed entry now throw `BadRequestError` (proposed checkpoints have no L1 publish info or attestations). ### Types kept `CheckpointData`, `CommonCheckpointData` (structural base of `CheckpointData` / `ProposedCheckpointInput`), `ProposedCheckpointData`, `ProposedCheckpointInput`, `PublishedCheckpoint`, `Checkpoint`. No structural-type deletions. Migration guidance for wallet/SDK consumers is in `docs/docs-developers/docs/resources/migration_notes.md`. ## Changes - **stdlib**: New query types (`CheckpointQuery`, `CheckpointsQuery`, `ProposedCheckpointQuery`) + Zod schemas in `block/l2_block_source.ts`. `'latest'` literal removed from `interfaces/checkpoint_parameter.ts`. `NormalizedCheckpointDispatch` type for the server's parameter normalizer. `ArchiverApiSchema` and `AztecNode` schema updated. `computeL2ToL1MembershipWitness` switched to the new query shape. - **archiver**: `data_source_base` adds `resolveCheckpointQuery` / `resolveCheckpointsQuery` mirroring the block-side helpers, implements the 4 confirmed methods plus the polymorphic proposed lookup. `BlockStore` adds `getProposedCheckpointBySlot(slot)`. `MockArchiver` and `mock_l2_block_source` updated to match the new interface. - **aztec-node**: `server.ts` adds the confirmed→proposed fallback flow with the two `BadRequestError` guards in `getCheckpoint`, sources all tips from a single `getL2Tips()` call in `getCheckpointNumber`, and routes the public RPC through the new internal methods. New pure-projection helper `projectProposedToCheckpointResponse` in `block_response_helpers.ts`. - **consumer migrations**: prover-node (collapses two checkpoint fetches into one `getCheckpoints({ epoch })`), world-state, slasher, sequencer (`checkpoint_proposal_job`, `sequencer`), validator (`proposal_handler`), `L2BlockStream`, pxe `block_stream_source`, telemetry wrapper, and 10 e2e files updated to the new query shapes. - **tests**: 48 new `it()` blocks covering each query discriminant, the throw guards, the confirmed→proposed fallback, the polymorphic `getProposedCheckpointData` dispatch, and `BlockStore.getProposedCheckpointBySlot`. - **docs**: `migration_notes.md` updated with the breaking changes for downstream wallet/SDK consumers.
…oposal check (#22989) ## Motivation `hasPayloadBeenProposed` (now `hasActiveProposalWithPayload`) used `eth_getLogs` over the rollup's full L1 deployment range to find prior `PayloadSubmitted` events. On long-lived rollups that range exceeds typical RPC provider block-range caps and the call times out, silently breaking the sequencer's "stop signaling for an already-proposed payload" logic. The previous in-memory cache also permanently blacklisted any payload it saw as proposed once, which is wrong: each round on `EmpireBase` is independent and the same payload can legitimately be re-signaled and re-submitted after a prior proposal becomes Dropped/Rejected/Expired/Executed. ## Approach Replace the log scan with a bounded view-call sweep over `Governance.proposals`. The sweep walks newest -> oldest using `proposalCount`, unwraps each proposal's `GSEPayload` via `getOriginalPayload()`, and treats only `Pending`/`Active`/`Queued`/`Executable` as "in an active proposal" -- terminal states allow re-signaling. The descent has a hard early-stop on the protocol-wide proposal lifetime cap (`4 * ConfigurationLib.TIME_UPPER = 360 days`), which is safe regardless of per-proposal frozen configs because every config field is bounded by `TIME_UPPER` on-chain. Two in-memory caches absorb the per-call cost over time: terminal proposals (provably immutable on-chain) and wrapper -> original payload unwraps (immutable bytecode). ## Changes - **ethereum/contracts/governance**: New `hasActiveProposalWithPayload(payload)` and `getProposalCount()` on `ReadOnlyGovernanceContract`. Inlines a minimal `IProposerPayload` ABI (just `getOriginalPayload`) to avoid generating a full artifact. Handles `proposeWithLock`-style proposals (no GSEPayload wrapper) by catching the unwrap revert and skipping. - **ethereum/contracts/governance (types)**: Adds explicit types (`Proposal`, `ProposalConfiguration`, `GovernanceConfiguration`, `ProposeWithLockConfiguration`, `Ballot`) and maps the viem return shapes of `getProposal` / `getConfiguration` onto them. `Proposal` now carries both `cachedState` (raw stored) and `state` (live, time-derived from `getProposalState`); `getProposal` issues both reads in parallel so callers don't need a separate state RPC. - **ethereum/contracts/governance (caching)**: Adds two memoization layers on `ReadOnlyGovernanceContract`. Proposals are cached when `state` is in any of the four terminal phases (Executed/Rejected/Dropped/Expired) -- once terminal the entire struct is provably immutable on-chain. Wrapper unwraps are keyed by wrapper address and cached forever (deployed bytecode is immutable). `GovernanceProposerContract` already memoizes its `getGovernance()`, so the same `ReadOnlyGovernanceContract` instance (and its caches) is reused across slots in the sequencer publisher. - **ethereum/contracts/governance_proposer**: Drops the event-based `hasPayloadBeenProposed`. Adds a memoized `getGovernance()` accessor and a thin `hasActiveProposalWithPayload` delegate that resolves the Governance address via the on-chain registry lookup. - **ethereum/contracts/empire_base**: Removes `hasPayloadBeenProposed` from `IEmpireBase` -- it's a Governance concern, not a generic empire concern (slasher doesn't need it). - **sequencer-client/publisher**: Removes the permanent `payloadProposedCache` so the publisher re-checks every slot, allowing re-signaling once a prior proposal is terminal. Switches the failure mode from fail-closed to fail-open (a flaky L1 endpoint should not silence governance participation; a duplicate signal is harmless). Narrows the helper's `base` param from `IEmpireBase` to `GovernanceProposerContract` since this code path is governance-only. - **ethereum/contracts (tests)**: New `hasActiveProposalWithPayload` describe block hitting a real anvil-deployed Governance. Impersonates the `governanceProposer`, calls `Governance.propose` directly, and etches hand-rolled mock wrapper bytecode at chosen addresses to drive (wrapper, original) pairs. Covers: empty governance, live match, no match, terminal state via warp, reverting wrapper (proposeWithLock-style), descent past unrelated proposals, case-insensitive match, and the 360-day hard cutoff via warp. Also adds a sync-guard describe block that probes `Governance.updateConfiguration` via impersonated `eth_call` to assert each of `votingDelay`/`votingDuration`/`executionDelay`/`gracePeriod` accepts `TIME_UPPER` and rejects `TIME_UPPER + 1` -- if those caps change on-chain, this trips and `MAX_PROPOSAL_LIFETIME_SECONDS` must be revisited. - **sequencer-client/publisher (tests)**: Replaces the cache test with a "re-checks each call so re-signaling resumes after terminal" test. Updates the RPC-failure semantics test from fail-closed to fail-open.
…ile in CI (#23000) ## Summary Fixes the `docs` build failure on `merge-train/spartan` (CI run [25449092262](https://github.com/AztecProtocol/aztec-packages/actions/runs/25449092262), log [27a4351a1e5e3568](http://ci.aztec-labs.com/27a4351a1e5e3568)). ## Problem `validate-webapp-tutorial` in `docs/examples/bootstrap.sh` intentionally starts each run with an empty `yarn.lock`, then runs `yarn install` to populate it from the `link:` paths it just wrote into `package.json`. In CI, Yarn 4 auto-enables `--immutable` when it detects `CI=1`, so the install fails with `YN0028 (frozen lockfile exception)` because populating an empty lockfile counts as modifying it. ``` ➤ YN0028: │ The lockfile would have been modified by this install, which is explicitly forbidden. ➤ YN0000: · Failed with errors in 6s 829ms ERROR: Contract artifact not found at /home/aztec-dev/aztec-packages/docs/target/pod_racing_contract-PodRacing.json ``` (The "Contract artifact not found" line is a downstream symptom — the script doesn't run with `set -e`, so after `yarn install` fails it continues into the artifact check and reports a misleading error.) ## Fix Set `YARN_ENABLE_IMMUTABLE_INSTALLS=false` for that one `yarn install` call, since populating the lockfile is the intended behaviour. ## Verification Reproduced locally: `CI=true yarn install` against the webapp-tutorial fails with `YN0028`; with `YARN_ENABLE_IMMUTABLE_INSTALLS=false` it succeeds. ClaudeBox log: https://claudebox.work/s/a1863de35053b544?run=1
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
…ts (#23009) No major changes needed
…22994) ## Motivation The `aztec.archiver.block_height` series with no status attribute (rendered as the "Pending chain" line on the network, prover, and fisherman Grafana dashboards) stopped being published a couple of weeks ago. With pipelining enabled every checkpoint arriving from L1 already has its blocks in the proposed store, so the L1 synchronizer always took the new promotion fast path introduced in #22716, leaving `checkpointsToAdd` empty and skipping the metric call. ## Approach Record the checkpointed block-height metrics across all valid checkpoints in the batch instead of only the ones routed through `addCheckpoints`, so the promoted checkpoint contributes too. The duration is averaged over the full batch since `addCheckpoints` performs the work for both paths in a single transaction. ## Changes - **archiver (`l1_synchronizer.ts`)**: Move the `processNewCheckpointedBlocks` call to use `validCheckpoints` rather than `checkpointsToAdd`, restoring the empty-status `block_height`, `checkpoint_height`, `sync_block_count`, and `sync_per_checkpoint` series under pipelining. --------- Co-authored-by: Alex Gherghisan <alexghr@users.noreply.github.com>
…23071) Logging-only change
Alternative fix for flake in epochs_mbps.pipeline. See also https://gist.github.com/AztecBot/164f3e04bd74f48cafd6505433935421.
…ource tip helpers (#22934) ## Motivation After three back-to-back unifications of the block/checkpoint APIs (#22781, #22809, plus the two query-object refactors on this stack), four `@deprecated` `AztecNode` RPC methods and three redundant `L2BlockSource` tip-number helpers had outlived their replacements and remained only as stop-gaps. This PR retires them and migrates every caller to the canonical query-object APIs. ## Approach Removed `isL1ToL2MessageSynced`, `getL2Tips`, `getBlockHeader`, `getCheckpointedBlocks` from `AztecNode`, and `getProvenBlockNumber`, `getCheckpointedL2BlockNumber`, `getFinalizedL2BlockNumber` from `L2BlockSource`. Callers now use `getL1ToL2MessageCheckpoint`, `getChainTips`, `getBlock(...).header`, `getBlocks(..., { onlyCheckpointed, includeL1PublishInfo, includeAttestations })`, and `getBlockNumber({ tag })` respectively. `BlockIncludeOptions` was split into a single-block variant and a `BlocksIncludeOptions` extension so `onlyCheckpointed` is rejected at the type level on `getBlock`. Internal `BlockStore` primitives are intentionally kept since they remain the underlying implementation. ## Changes - **stdlib (interfaces)**: dropped four `@deprecated` `AztecNode` methods + their zod entries; dropped three tip-number helpers from `L2BlockSource` and its archiver schema; split `BlockIncludeOptions` into single- and range-block variants - **aztec-node**: removed deprecated server impls; simplified `getBlockNumber(tip)` to a single `getBlockNumber({ tag: tip })` call; fixed `getL1ToL2MessageCheckpoint` to handle `messageIndex === 0n` correctly (previously coerced to `undefined` via truthy check) - **archiver**: dropped the now-unused tip-number passthroughs in `data_source_base` and the `MockL2BlockSource` overrides - **prover-node, p2p**: migrated `getProvenBlockNumber` callers to `getBlockNumber({ tag: 'proven' })` - **pxe**: adjusted `block_stream_source` to wrap `getChainTips()` into the `L2Tips` shape required by `L2BlockStream` - **txe**: added `l2TipsProvider` getter that adapts `getChainTips()` for the TXE state machine - **end-to-end (tests)**: migrated 15+ test files to the new APIs (`getBlocks` with `onlyCheckpointed`/`includeTransactions` where bodies are read, `getChainTips`, `getBlock(...).header`, `getL1ToL2MessageCheckpoint(...) !== undefined`) - **aztec-node, stdlib (tests)**: dropped tests of removed methods; added unit tests covering the `messageIndex === 0n` edge case - **docs**: updated the node-API generator to drop removed methods, regenerated the operator API reference, and migrated `node_getL2Tips` curl examples in the operator setup guides to `node_getChainTips`
## Summary - Keep `getPublicIp()` at startup so the ENR always has a valid IP from the start - Enable discv5 `enrUpdate` with `addrVotesToUpdateEnr: 1` and faster pings (10s) when `queryForIp` is enabled, so PONG votes can correct the IP at runtime if it changes (e.g. residential ISP, Cloud NAT rotation) - Bridge discv5 IP changes to libp2p's AddressManager so peers see updated addresses - Have the bootnode explicitly `addEnr()` on discovery to fix routing table gaps where nodes were never inserted - Improve P2P observability: log KAD table state in peer manager heartbeats, log ENR additions with multiaddrs, log config at startup - Small change to deploy scripts that allows us to define a full aztec image to deploy on a network rather than just `aztecprotcool/aztec:<tag>` Fixes [A-310](https://linear.app/aztec-labs/issue/A-310/p2p-query-for-ip-should-detect-ip-changes)
…artan (#23083) ## Motivation The `verifies transactions at 10 TPS` sub-test of [`yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts`](https://github.com/AztecProtocol/aztec-packages/blob/merge-train/spartan/yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts) is now reliably flaking on the `bench all` step of `merge-train/spartan`. It has fired on at least two different merge-train commits hours apart, with no relation to either commit's diff: | Run | Triggering merge-train commit | CI log | |---|---|---| | [25546251580](https://github.com/AztecProtocol/aztec-packages/actions/runs/25546251580) | #22934 (refactor(node-rpc)! removing deprecated AztecNode methods) | http://ci.aztec-labs.com/1778227975844707 | | [25552992890](https://github.com/AztecProtocol/aztec-packages/actions/runs/25552992890) | #22405 (feat(p2p): detect and track announce IP changes at runtime) | http://ci.aztec-labs.com/1778237470322975 | Both runs hit the same assertion: ``` ● transaction benchmarks › verifies transactions at 10 TPS expect(received).toBe(expected) // Object.is equality Expected: true Received: false at bench/tx_stats_bench.test.ts:268 ``` Sub-test failing log on the latest run: http://ci.aztec-labs.com/ca459ca73d02002c (`bench all` parent: http://ci.aztec-labs.com/90616bad7bf7ebaa). The other three sub-tests in the suite (compression; single private verify x20 serial; single public verify x20 serial) pass cleanly against the same proven txs in both runs. The failure is in the stress sub-test that fires 600 IVC verifications at 10/s with 8 concurrent IVC verifiers (`BB_NUM_IVC_VERIFIERS=8`, `BB_IVC_CONCURRENCY=1`). At least one verification returns `valid: false` under load. ## Cause Neither triggering commit touches the IVC verifier path: - #22934 is a pure node-rpc surface refactor. - #22405 is p2p / discv5 ENR plumbing. The two failures sharing this signature across unrelated diffs is strong evidence that the flake is independent of the merge-train commit and stems from the bench infrastructure itself. The likely culprit is the recent bb-prover migration to the bb.js `NativeUnixSocket` backend (#21564), which spawns a fresh bb subprocess per Chonk verification via `withVerifierInstance`. Under 8x parallel verifications on the CPU-isolated bench host (each verifier requesting 16 threads, 8 × 16 = 128 threads on 56 isolated cores), transient verifier failures appear. The bench-output log shows continuous `bb.js - Received signal 15, shutting down gracefully...` traffic during the 10 TPS phase — verifier instances are being torn down rapidly, and at least one verification slips through with a stale/incomplete response. Because the serial sub-tests (`numIterations = 20` sequential) pass cleanly in both runs, this is a stress-only interaction, not a correctness regression. ## Approach Add `tx_stats_bench` to `.test_patterns.yml` with an `error_regex` anchored to the test file's stack-trace line (`tx_stats_bench.test.ts:<line>:<col>`), and assign `*charlie` as owner (author of the bb.js migration). With this entry, `ci3/run_test_cmd` retries the test once on failure and treats a single retry-pass as a flake instead of a hard fail, unblocking the merge train for unrelated commits while Charlie investigates the underlying concurrency interaction with the bb.js backend. The `error_regex` is intentionally narrow (file + line + column from the stack trace) so other ways tx_stats_bench could fail (timeout, OOM, infra) are still surfaced as hard fails. ## Changes - `.test_patterns.yml`: add a `tx_stats_bench` entry with an error_regex anchored to the test file's stack-trace line and `*charlie` as owner. ClaudeBox logs: - https://claudebox.work/s/6e7853d3a073145f?run=1 (initial diagnosis on #22934 failure) - https://claudebox.work/s/c12a360275f05ad3?run=1 (this update on #22405 recurrence)
…ning (#23090) Slashing votes are EIP-712-signed for `targetSlot` (the pipelined proposal slot, not the wall-clock slot) and submitted via Multicall3.aggregate3 with allowFailure: true. The contract verifies the signature against getCurrentSlot() derived from block.timestamp, so the multicall must mine in the slot the vote was signed for or the inner sub-call reverts silently and VoteCast is never emitted. Two paths in the sequencer were sending vote-only multicalls without delaying submission to the target-slot start: 1. CheckpointProposalJob.execute() if (!broadcast) branch — proposer enqueued votes but did not build a checkpoint. 2. Sequencer.tryVoteWhenSyncFails — proposer enqueued votes in a slot where archiver sync had not caught up. Both now route through `sendRequestsAt(getTimestampForSlot(targetSlot))` when proposer pipelining is enabled. The sync-failure path uses fire-and-forget so the wait does not block the sequencer's work loop.
…23056) ## Motivation At a pruning epoch boundary, today's `canProposeAtTime` simulation pre-emptively reverts when an unproven epoch's deadline is about to expire — even if the proof lands seconds later. The slot is silently skipped. This loses a checkpoint window for no good reason: the publisher's preCheck right before L1 submission is the authoritative gate. Similarly, the simulation overrides applied to the preCheck flight due to pipelining (as in overriding the pending chain with the last mined slot) meant that we were silently missing the case where the epoch prune did trigger, so we were sending the tx and reverting. This is fixed by having different plans for the first simulation and the right-before-submission simulation. That said, sequencer publisher checks are a bit convoluted now, so I'm making a pass at them to try and simplify in a later PR. ## Approach Apply a proven-override at the three pre-submission simulation sites (`canProposeAt`, the globals builder, and enqueue-time `validateCheckpointForSubmission`) that forces `pending == proven` so `STFLib.canPruneAtTime` short-circuits to false. Submission's preCheck runs without the override against real L1 state and decides whether to actually send. A new structured `preparing-checkpoint` sequencer event surfaces the override/parent state for tests. Tip storage now goes through a single `makeChainTipsOverride` to avoid same-slot state-diff clobbering. ## Changes - **archiver**: `isPruneDueAtSlot(slot)` on `L2BlockSource` replicates `STFLib.canPruneAtTime` locally (no L1 RPC). - **ethereum**: `RollupContract.makeChainTipsOverride({pending?, proven?})` writes a single combined state-diff and guards `proven > pending`. `forPendingCheckpoint(n)` → `withChainTips({pending?, proven?})` on the simulation overrides builder. - **sequencer-client (publisher)**: `enqueueProposeCheckpoint` accepts `preCheckSimulationOverridesPlan` separately from `simulationOverridesPlan`; the preCheck closure uses it (no fallback) so the parent / proven overrides never reach pre-send validation. - **sequencer-client (sequencer)**: applies the proven override at the canProposeAt site, plumbs it through `prunePending` to `CheckpointProposalJob` so the globals builder and enqueue-time validation see it. New `pauseProposingForSlots` test-only config. - **sequencer-client (events)**: new `preparing-checkpoint` event with `targetSlot`, `checkpointNumber`, `hadProposedParent`, `provenOverride`. - **ethereum (test infra)**: `Delayer.pauseNextTxUntil*` accept a per-call timeout to support boundary tests that need to wait > 180s. - **end-to-end (new tests)**: `epochs_proof_at_boundary.parallel.test.ts` covers smoke + four boundary scenarios — proof lands during pipeline sleep; proof lands well before deadline; proof never lands (with parent); proof lands / never lands without proposed parent — using structured events and `retryUntil` rather than log greps. - **stdlib + interfaces**: schemas and configs updated for the new RPC method and the new sequencer config knob.
## Summary Fixes a build error in `sequencer-client/src/sequencer/sequencer.ts:454`: ``` error TS2551: Property 'pendingCheckpointNumber' does not exist on type 'SimulationOverridesPlan'. Did you mean 'pendingCheckpointState'? ``` The `pendingCheckpointNumber` field was removed from `SimulationOverridesPlan` and replaced with `chainTipsOverride.pending`. The log context in `proposeContext` was still referencing the old field. Updated the reference to use `simulationOverridesPlan?.chainTipsOverride?.pending`, matching the existing usage on line 438. ## Test plan - `yarn workspace @aztec/sequencer-client build` succeeds
…dary (#23108) ## Summary Fixes flaky CI on `merge-train/spartan` ([run](https://github.com/AztecProtocol/aztec-packages/actions/runs/25570963690), [log](http://ci.aztec-labs.com/1778262953204813)) where `epochs_proof_at_boundary.parallel.test.ts > proof never lands so no checkpoint submission is attempted` failed with: ``` expect(received).toBe(expected) Expected: 31 Received: 32 > 312 | expect(Number(firstPostBoundary.slot)).toBe(Number(boundarySlot) + 1); ``` ## Root cause The assertion's inline comment explicitly acknowledges this is *empirical*: whether the on-chain prune fires in-tx at `boundarySlot+1` or only at `boundarySlot+2` depends on real-time L1 / proposer-rebuild timing. In this run, slot 31's pipelined propose still failed (`Rollup__InvalidArchive`) and slot 32 was the first slot where the propose was accepted and the checkpoint published. The merge-train head — #23098 (one-line log-context fix) — cannot influence this timing. The flake originated from #23056 (`feat(sequencer): build optimistically across pruning epoch boundary`) earlier in the same train. ## Fix Relax `toBe(boundarySlot + 1)` → `toBeLessThanOrEqual(boundarySlot + 2)` for both the no-parent and with-parent variants of "proof never lands". The lower bound is already enforced by `waitForFirstCheckpointAfterBoundary` filtering for `slot > boundarySlot`. The test's intent (a checkpoint lands in the new epoch shortly after the boundary) is preserved. The other two boundary tests where the proof DOES land use `checkpointNumber >= boundaryPublished.checkpoint`, not slot equality, so they aren't affected. Full analysis: https://gist.github.com/AztecBot/b4010e694332cca93a51024915867e9a ## Test plan CI on this PR. The container ClaudeBox runs in lacks docker / writeable cache, so local `./bootstrap.sh ci` could not be executed. ClaudeBox log: https://claudebox.work/s/d49b46d7e0cb49a6?run=1
The ArchiverDataStoreUpdater used to call `l2TipsCache.refresh()` inside the `db.transactionAsync()` callback for every writer path. Two issues: 1. Mid-tx visibility. `refresh()` reassigns its internal #tipsPromise synchronously, which was observable to other callers before LMDB had actually committed. A concurrent reader calling `getL2Tips()` after the reassignment but before commit picks up a promise loaded against the in-flight tx state, while a sibling read on `#proposedCheckpoints` directly outside the tx still sees pre-commit state — split-snapshot reads in the sequencer's `checkSync()`. 2. No rollback on tx abort. If the LMDB transaction threw or aborted, the cache had already been replaced with a promise loaded against in-flight writes that would never commit. Future readers would see a cache reflecting rolled-back state. Refresh now runs after the writer transaction has fully committed, so it loads from the committed store and is never replaced when the writer aborts. This does not close the JS-side race window completely — there is still a small "tips lag store" window between LMDB commit returning and `refresh()` finishing its `loadFromStore`. The sequencer's `checkSync()` consistency checks (sequencer-client/src/sequencer/sequencer.ts ~L700) already handle that residual window by detecting the mismatch and returning undefined; those checks are intentionally left in place.
…ctions" This reverts commit 4735e42.
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
…ning per-call (#23093) ## Why Follow-up to #21564 (bb-prover bb.js migration) addressing the IVC verification perf regression that surfaced in `tx_stats_bench`. The migration kept the legacy spawn-per-verification model: every chonk/ultra-honk verification through `BBCircuitVerifier` spawned a fresh `bb` process and SIGTERMed it after one proof. `BB_NUM_IVC_VERIFIERS=8` only capped concurrency at the queue layer (`QueuedIVCVerifier`), not the number of bb processes. That made the bench spawn ~600 bb processes over its 60s 10 TPS phase inside an 8-CPU isolate. Two compounding problems: 1. ~50–100 ms of `bb` startup tax on every verification's hot path. 2. The bind→listen race in `NativeUnixSocket`: bb's socket file appears after `bind()` but before `listen()`. A TS `connect()` landing in that window gets `ECONNREFUSED`. Vanishingly rare under low load; reliable flake under contention. Diagnosis at http://ci.aztec-labs.com/735256f13a268733. ## What ### Make `BB_NUM_IVC_VERIFIERS` mean what its name says (commits aa99817, 0f4cb77) Pool of long-lived bb verifier processes instead of fresh-per-call. The factory class is renamed `BBJsProverFactory` → `BBJsFactory` (it's used for both proving and verifying) and given a single `getInstance(): Promise<BBJsApi & AsyncDisposable>` method: - `new BBJsFactory(path)` → no pool. Every `getInstance()` spawns a fresh bb that is destroyed on dispose. Same as the previous `withFreshInstance` behaviour — used by `BBNativeRollupProver`, the AVM proving tester, and ivc-integration helpers, so their semantics are unchanged. - `new BBJsFactory(path, { poolSize: N })` → pool of N long-lived bb processes, lazily spawned on first acquire. Used by `BBCircuitVerifier` with `poolSize: numConcurrentIVCVerifiers`. Callers use `await using inst = await factory.getInstance()` for RAII-style release, matching the codebase's preference for `AsyncDisposable`. `BBCircuitVerifier.stop` (already wired through to aztec-node shutdown) tears the pool down. ### Close the bind→listen race in bb.js (commit 8e519b0) `barretenberg/ts/src/bb_backends/node/native_socket.ts`: retry `connect()` on `ECONNREFUSED` with exponential backoff (capped at 50 ms) up to the existing 5 s budget. Other socket errors fail fast as before. Pool startup still spawns N bb processes in parallel, so the race surface is reduced from ~600 to N — the retry handles the residual. ### Server-side Chonk proof split (commit 97577cf) `splitChonkProofToStructured` in TS had three hand-maintained constants (`MERGE_PROOF_SIZE`, `ECCVM_PROOF_LENGTH`, `JOINT_PROOF_LENGTH`) duplicating C++ values. When C++ shifted Chonk layout (e.g. databus relation changes shrinking the oink portion in the previous round of regressions), these went stale and verification failed deep in the verifier with an opaque "OinkVerifier: num_public_inputs mismatch with VK". Add a new `ChonkVerifyFromFields` bbapi command that takes a flat `Vec<bb::fr>` and calls `ChonkProof::from_field_elements` server-side, then runs the verifier. The TS layer now passes flat fields straight through — no layout knowledge, no hand-maintained constants. - `bbapi_chonk.{hpp,cpp}`: new struct + `execute()`. - `bbapi_execute.hpp`: register the variant. - `bb_js_backend.ts`: `verifyChonkProof` calls the new API; `splitChonkProofToStructured` and the 3 constants are deleted. ### Disposal robustness (commit 5cde220) The first cut of `BBJsFactory` had three `.catch(() => {})` clauses that silently swallowed bb `destroy()` errors, and an `initPool()` that dropped already-spawned bb children if a sibling creation failed (`Promise.all` short-circuit). Both would manifest as the Jest "worker failed to exit gracefully" warning we hit on one test run. Now: destroy errors propagate (`AggregateError` for the pool path); `initPool` uses `allSettled` and tears down anything it spawned if any sibling rejects. ### Playground bundle size (commit 1681d33) The new `ChonkVerifyFromFields` bbapi variant tipped the playground main entrypoint over the 1750 KB hard limit. Bumped to 1800 with a bump-log entry. ## Effect - `tx_stats_bench`: 600 bb spawns → 8 bb spawns at boot, then 8 long-lived processes serve every verification. The bind→listen race surface drops 75×, *and* the residual is handled by the connect retry. Per-call ~50–100 ms `bb` startup cost disappears from the verifier hot path. - Brittle TS Chonk constants are gone — Chonk layout changes in C++ can no longer manifest as opaque verifier errors in TS. - Disposal failures surface instead of leaking bb children. - Behaviour for proving paths (`BBNativeRollupProver`, AVM tests, ivc-integration) is unchanged — they still spawn fresh per call. ClaudeBox log: https://claudebox.work/s/2d65052b0deaeab2?run=3 --------- Co-authored-by: Charlie <5764343+charlielye@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…23113) ## Motivation Under proposer pipelining, checkpoint N's fee asset price modifier is computed in slot N-1 before checkpoint N-1 has landed on L1. The proposer was reading `rollupContract.getEthPerFeeAsset()`, which still reflects the latest published checkpoint (commonly N-2), while L1 later applies the modifier against checkpoint N-1's `ethPerFeeAsset`. The mismatch produced a 1-checkpoint drift between the proposer's intended new price and the price L1 actually stored, causing the e2e price-convergence test to oscillate around the target instead of converging. ## Approach Threads the predicted parent's `ethPerFeeAsset` through to the modifier computation. `buildPipelinedParentSimulationOverridesPlan` already derives that fee header for global-variable simulation overrides; the pending fee header on the resulting plan is used as the reference price for the bps calculation. Non-pipelined paths and genesis (checkpoint < 2) fall back to today's `getEthPerFeeAsset()` read. ## Changes - **ethereum**: `FeeAssetPriceOracle.computePriceModifier()` now takes an optional `currentPriceE12`; when supplied, the L1 read is skipped. - **sequencer-client**: `SequencerPublisher.getFeeAssetPriceModifier()` forwards the optional predicted price; `checkpoint_proposal_job` reads it from the pipelined simulation overrides plan and passes it through. - **end-to-end**: enables proposer pipelining + `inboxLag: 2` in `fee_asset_price_oracle_gossip.test.ts`. - **ethereum (tests)**: adds 3 unit tests covering the L1-read fallback, the predicted-price short-circuit, and concrete-value consistency with `RollupContract.computeChildFeeHeader` (asserting modifier truncation leaves a sub-bp gap to target). Note: if pipelining is enabled at checkpoint ≥ 2 but `proposedCheckpointData` is missing, the predicted-parent will be `undefined` and the code silently falls back to the stale L1 read. That's a pre-existing failure-mode behavior, not introduced here.
…ulation (#23073) ## Motivation When proposer pipelining is enabled, the sequencer simulates `propose()` for checkpoint K one slot ahead while K-1 has not yet landed on L1. The previous override only patched `tips.pending`, `archives[K-1]`, and (sometimes) the fee header, leaving the rest of `tempCheckpointLogs[K-1]` at storage zero. With `slotNumber` zeroed, `canPruneAtTime` falsely declared the proof window expired, the contract returned `proven` from `getEffectivePendingCheckpointNumber`, and the precheck reverted with `Rollup__InvalidArchive` — surfacing as a `proposer-rollup-check-failed` storm whenever a checkpoint took an extra L1 block to land. Additionally, fixes a bug in L2-to-L1 messages related to how the `outHash` is computed by the proposer (see "include parent checkpointOutHash when pipelining same-epoch builds"). Also adds sanity checks to `checkSync` to guard against race conditions when querying archiver data. ## Approach The simulated `tempCheckpointLogs[K-1]` cell is now byte-faithful with what L1 will see once K-1 actually lands: header hash, out hash, payload digest, slot number, and fee header. `blobCommitmentsHash` and `attestationsHash` are intentionally left out — the propose path never asserts on them. The override is built through a single per-cell helper that throws on `slotNumber > uint32`, mirroring the on-chain `SafeCast.toUint32`. ## Changes - **stdlib (`checkpoint/digest.ts`)**: new shared `computeCheckpointPayloadDigest` helper. Archiver migrated to it. - **ethereum (`rollup.ts` / `chain_state_override.ts`)**: replaces `makeFeeHeaderOverride` with `makeTempCheckpointLogOverride` (all-required) and `makeTempCheckpointLogPartialOverride` (subset). Extends `PendingCheckpointOverrideState` and `SimulationOverridesBuilder` with `withPendingHeaderHash/OutHash/PayloadDigest/SlotNumber`. Plan translation now goes through the partial helper so a missing fee header no longer suppresses the rest. - **sequencer-client**: `buildPipelinedParentSimulationOverridesPlan` takes a `signatureContext`, populates the new fields when `proposedCheckpointData` matches the parent, and guards against stale entries. The inline override in `Sequencer` is consolidated through the helper, with a defensive archive fallback when `proposedCheckpointData` is absent. `CheckpointProposalJob` threads the signature context through. - **end-to-end (`epochs_mbps.parallel.test`)**: switches the test to the pipelined-MBPS timing (12s L1 / 72s L2 / 5500ms blocks, `enableProposerPipelining: true`, `perBlockAllocationMultiplier: 8`) and asserts there are no `proposer-rollup-check-failed` events under normal operation. - **.test_patterns.yml**: marks the L2-to-L1-messages variant of the test as `skip: true` for an unrelated `Tx dropped by P2P node` flake under the new pipelined timing — tracked as a follow-up. - **tests**: new unit tests for `makeTempCheckpointLogOverride` (storage-slot round-trip via `getCheckpoint`, slot-overflow throw, partial-emission), `withPending*` builders, and the populated/empty/stale-checkpoint paths in `buildPipelinedParentSimulationOverridesPlan`.
Enable pipelining on the missed L1 slot e2e test
Fixes an issue where stuck requests would update the gauge after it was already updated by subsequent requests that succeeded quickly. This forces the gauge to always be updated in sequence, or the result is just dropped Also added some logging so we can see what's happening
- Preserve local validator slashing-protection records across the known LMDB schema 1 -> 2 migration. - Add a fail-closed schema mismatch policy for versioned stores and wire it into signing protection. - Add regression coverage for preserving legacy duty records and refusing newer stored schemas. Fixes [A-1029](https://linear.app/aztec-labs/issue/A-1029/prevent-lmdb-slashing-protection-reset-on-schema-mismatch) Fixes AztecProtocol/aztec-claude#888
Toggle pipelining on all e2e p2p tests
…23110) Move every `l2TipsCache.refresh()` call in `ArchiverDataStoreUpdater` out of the surrounding `db.transactionAsync` callback and into the post-commit code path. This addresses two issues with the previous in-transaction refresh: 1. **Mid-tx visibility.** `L2TipsCache.refresh()` reassigns its internal `#tipsPromise` synchronously, which was observable to other callers before LMDB committed. A concurrent reader calling `getL2Tips()` after the reassignment but before commit would pick up a promise loaded against in-flight tx state, while a sibling read on the store directly outside the tx still saw pre-commit state. 2. **No rollback on tx abort.** If the LMDB transaction threw or aborted, the cache was already replaced with a promise loaded against in-flight writes that would never commit. Future readers saw a cache reflecting rolled-back state. Refresh now runs after the writer transaction has fully committed, so it loads from the committed store and is never replaced when the writer aborts. ## Notes - This intentionally leaves a narrow JS-side race window between LMDB commit returning and `refresh()` finishing its `loadFromStore`.
…tart (#23162) ## Motivation `e2e_p2p_data_withholding_slash` was flaky because L1 raced past the epoch-8 prune deadline (`aztecProofSubmissionEpochs=0` makes the deadline ~32s after slot 17) while we stopped, wiped, and recreated the 4 validators (~28s). The recreated archivers detected the prune during their initial L1 sync and emitted `L2PruneUnproven` for epoch 8 with the original tx-carrying block, but `EpochPruneWatcher.start()` is only invoked inside `void archiver.waitForInitialSync().then(...)` in `aztec-node/server.ts`, so the listener wasn't attached yet and the event dropped silently. The recreated validators then built an empty epoch 10 on top of genesis which pruned cleanly later, producing 4 `VALID_EPOCH_PRUNED` offenses instead of the expected 4 `DATA_WITHHOLDING`. ## Approach Pause anvil block production between `removeInitialNode` and `stopNodes` so L1 stays inside epoch 8 across the recreate gap. The recreated archivers then ingest checkpoint 1 cleanly during initial sync (no prune fires, nothing to miss), `EpochPruneWatcher.start()` attaches its listener, and we resume L1 with an explicit warp + mine + interval restart so the deadline crossing is deterministic — the prune now fires while the watcher is live, producing `DATA_WITHHOLDING` for epoch 8 as the test expects. A `getCurrentEpoch < 9` assertion right after pausing fails fast if the timing window ever tightens further. ## Changes - **end-to-end (tests)**: in `data_withholding_slash.test.ts`, pause L1 mining after `removeInitialNode` and before `stopNodes`; resume after `waitForP2PMeshConnectivity` by warping to current wall-clock time, mining one L1 block, and restoring interval mining. Add a fail-fast assertion that we are still in epoch 8 when we pause.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(test): warp L1 forward when proposer scan hits EpochNotStable (#22967)
test(e2e): fail epochs tests on proposer-rollup-check-failed (#22965)
fix: grafana switch to aztec_status="proposed" (#22978)
chore: update benchmark scraper (#22984)
test(e2e): migrate simple epoch tests to pipelining (#22973)
chore: remove top-level yarn.lock (#22987)
refactor(archiver)!: unify L2BlockSource checkpoint lookups via query objects (#22933)
fix(sequencer): bounded sweep instead of event scan for governance proposal check (#22989)
fix(docs): allow webapp-tutorial yarn install to populate empty lockfile in CI (#23000)
test(e2e): enable pipelining in l1-reorgs and mbps redistribution tests (#23009)
fix(archiver): restore pending block height metric under pipelining (#22994)
chore(p2p): remove skipped validation result option (#23034)
refactor(p2p)!: remove slow tx collection flow (#22878)
chore(spartan): add next-net-clone environment config (#22995)
chore(sequencer): add context to proposer-rollup-check-failed logs (#23071)
test(e2e): wait for archiver sync before asserting pipelining (#22997)
refactor(node-rpc)!: remove deprecated AztecNode methods and L2BlockSource tip helpers (#22934)
feat(p2p): detect and track announce IP changes at runtime (#22405)
test: mark tx_stats_bench 10 TPS as flake-retryable on merge-train/spartan (#23083)
fix(sequencer): bind vote-only multicalls to target slot under pipelining (#23090)
feat(sequencer): build optimistically across pruning epoch boundary (#23056)
fix(sequencer): use chainTipsOverride.pending for log context (#23098)
test(e2e): relax post-boundary slot assertion in epochs_proof_at_boundary (#23108)
fix(bb-prover): pool long-lived bb verifier processes instead of spawning per-call (#23093)
fix(sequencer): anchor fee asset price modifier to predicted parent (#23113)
chore: error log when L1 head timestamp drifts (#22947)
fix(sequencer): override full parent checkpoint cell in pipelined simulation (#23073)
test(e2e): enable pipelining on missed l1 slot test (#23068)
fix: more robust metrics reporting in IRM monitor (#23038)
fix: preserve LMDB slashing protection (#23145)
test(e2e): enable pipelining on p2p tests (#23070)
fix(archiver): move L2 tips cache refresh out of write transactions (#23110)
test(e2e): fix data_withholding_slash flake by freezing L1 across restart (#23162)
END_COMMIT_OVERRIDE