Skip to content

run_check_if_parent_block_is_last_block cancels ~67% of building jobs on non-datacenter hardware #909

@ajlone73-oss

Description

@ajlone73-oss

GitHub Issue Draft — flashbots/rbuilder

TITLE:

run_check_if_parent_block_is_last_block cancels ~67% of building jobs on non-datacenter hardware


BODY:

Summary

The continuous last_block_number check in run_check_if_parent_block_is_last_block() cancels approximately 67% of building jobs on consumer/VPS hardware. The root cause is a timing mismatch between Reth's database commit latency and the 100ms check interval.

Environment

  • rbuilder develop branch (commit 55bbd32, also tested on 80ebfc8)
  • Reth 1.9.3 (commit 27a8c0f5, same as rbuilder's pinned version)
  • Tested on both integrated (reth-rbuilder) and standalone modes
  • Hardware: Intel i5-12600K / 64GB RAM / NVMe (integrated) and AMD EPYC 4244P / 64GB RAM / NVMe VPS (standalone)
  • Lighthouse v8.1.2 as CL

The Problem

Logs show constant cancellations:

INFO Cancelling building job reason="last block number" last_block_number=24753843 block=24753845
INFO Cancelling building job reason="last block number" last_block_number=24753843 block=24753846

last_block_number consistently trails the canonical head by 2-3 blocks. Measured bid success rates:

Mode Bid Rate Hardware
Integrated 32.6% i5-12600K, 64GB, consumer NVMe
Standalone 31% EPYC 4244P, 64GB, datacenter VPS

Config tuning attempted with no improvement: root_hash_sparse_trie_version (v1, v2, vexp), root_hash_threads (0, 4, 6), faster_finalize (true/false).

Root Cause

run_check_if_parent_block_is_last_block() in crates/rbuilder/src/live_builder/building/mod.rs polls every 100ms:

const CHECK_LAST_BLOCK_INTERVAL: Duration = Duration::from_millis(100);

let last_block_number = provider.last_block_number()?;
if last_block_number + 1 != block_ctx.block() {
    block_cancellation.cancel();
}

last_block_number() returns the highest block committed to Reth's MDBX database on disk — not the latest block processed in memory. Reth's database write pipeline introduces 200ms-2s latency between "Block added to canonical chain" and "Canonical chain committed" (consistent with paradigmxyz/reth#8307).

The CL sends payload_attributes at slot boundary. rbuilder starts building for block N+1. The 100ms check fires, asks Reth for last_block_number(), gets N-2 (disk commit still in progress), sees N-2+1 != N+1, and cancels.

The sparse trie is not involved — the building job is cancelled before the trie does any work. The trie operates on a fixed parent block reference set at job start.

Why BuilderNet Doesn't See This

Datacenter hardware with NVMe RAID arrays and 256GB+ RAM keeps MDBX commit latency under 100ms — within the check interval. Consumer and VPS hardware cannot match this.

Fix

Commenting out the spawn_blocking call that launches the continuous check:

// crates/rbuilder/src/live_builder/building/mod.rs, lines ~110-117
// BEFORE:
        {
            let provider = self.provider.clone();
            let block_ctx = block_ctx.clone();
            let block_cancellation = block_cancellation.clone();
            tokio::task::spawn_blocking(move || {
                run_check_if_parent_block_is_last_block(provider, block_ctx, block_cancellation);
            });
        }

// AFTER: commented out entirely

Results

Machine Before After
Integrated (i5-12600K) 32.6% 100% (937/937 bids, 0 cancellations)
Standalone (EPYC 4244P) 31% 97% (862 bids, 0 cancellations, 26 IPC header misses)

Safety Analysis

The check's purpose is reorg detection during building. Safety considerations:

  1. Parent header is already validated before building starts in wait_for_block_header(). This prevents building on the wrong chain.
  2. If a reorg occurs during building, the relay rejects the block (parent hash mismatch). No funds at risk — just a wasted building cycle.
  3. Mainnet reorgs are extremely rare (~1-2 per month).
  4. The max_time_to_build timeout still applies — building jobs have a natural deadline.

The cost of the check (67% of building jobs cancelled on non-datacenter hardware) vastly exceeds the benefit (catching ~2 reorgs/month).

Suggested Improvement

Rather than removing the check entirely, consider one of:

  1. Replace last_block_number() with best_block_number() — returns the in-memory head rather than disk-committed state. This would make the check hardware-independent.

  2. Add a grace period at startup — wait for last_block_number() to catch up before starting the check loop, with a configurable timeout.

  3. Make the check configurable — add a disable_last_block_check config option so non-datacenter operators can opt out.

Impact

This affects any rbuilder operator running on hardware where Reth's MDBX commit takes >100ms. This likely includes most solo validators and small operators — exactly the audience rbuilder is designed to serve as an open-source block builder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions