Fix witness#96
Open
On1x wants to merge 383 commits into
Open
Conversation
…gress - Added check for active sync with peers before suppressing block inventory - Trigger sync proactively if behind and peer advertises blocks with no sync running - Logged warning when initiating sync to break inventory gate deadlock - Maintained existing debug logs for skipped inventory broadcasts when behind - Prevented node from being stuck despite connected peers having needed blocks
… applied blocks - Remove blocks from peers' ids_of_items_to_get if already on chain to avoid stale re-requests - Detect peers with fully empty sync lists post-cleanup and send final synopsis to confirm sync completion - Add safety net to auto-clear stuck we_need_sync_items_from_peer flags after 30 seconds of inactivity - Clean up ids_of_items_being_processed when blocks are already applied via fork switch to unblock sync - Send final synopsis to peers whose sync lists become empty after block acceptance through broadcast mechanisms
- Add auto-clear for `we_need_sync_items_from_peer` if sync lists remain empty for 30+ seconds - Implement stale `ids_of_items_to_get` cleanup to remove blocks already applied via fork switch - Fix handling of blocks in `_most_recent_blocks_accepted` to clean up sync state properly - Introduce last-resort safety net in `terminate_inactive_connections_loop()` for stuck sync flags - Add gap/fork block sync stall recovery to handle missing parent and fork switch race conditions - Trigger sync restart on gap blocks deferred to `fork_db` when no sync is active - Break inventory gate deadlock by starting sync if no sync in progress and broadcast inventory suppressed - Add emergency consensus checks to prevent false recovery triggers during head advancement
- Change database return value from false to true in failure case to reflect acceptance - Ensure _most_recent_blocks_accepted updated only if block is accepted - Introduce block_was_accepted flag to track block acceptance status consistently - Modify block handling to reset unlinkable_block_strikes only if block is accepted - Update inventory fetch logic to conditionally trigger advertise inventory loop - Broadcast validated blocks only when they are accepted by the client - Improve hard fork handling by disconnecting peers with outdated client versions - Add detailed disconnect reason and exception for peers lagging behind due to forks
…eer role - Defined senior engineer role focused on precise implementation speed and correctness - Specified rules emphasizing minimal and traceable code changes - Outlined expected agent behavior including progress reporting and blocker flagging - Detailed workflow steps for understanding, planning, implementing, and verifying changes - Established output format requirements for plan, changes made, and verification - Clarified constraints on what must and must not be done during task execution - Set success criteria ensuring correctness, traceability, and no unintended impact
…skip the fork-switch logic and fall through directly to apply_block()
…notification for in-sync peers
…0 && gap <= 2. Added DLT-specific minority fork detection that works even during emergency mode. Added a DLT-mode-specific check at the top of maybe_produce_block(): if db._dlt_mode && chain().is_syncing(), return not_synced immediately.
… lock contention no longer prevents seed reconnection and timer reset.
… status not catch sync changes)
…eued block IDs The outbound `on_fetch_blockchain_item_ids` handler called `start_synchronizing_with_peer()` when a peer was ahead, which unconditionally clears `ids_of_items_to_get`. This races with inbound inventory responses: block IDs are populated then immediately wiped, stalling the fetch loop with nothing to request. - Guard both `start_synchronizing_with_peer()` call sites: skip if `ids_of_items_to_get` is already populated - Wrap `fetch_sync_items_loop()` body in try-catch to prevent silent loop termination on peer disconnect or delegate exceptions - Wrap `send_message()` in `request_sync_items_from_peer()` and `request_sync_item_from_peer()` with cleanup on failure - Add stale `_active_sync_requests` cleanup (entries pending >30s) - Add auto-restart safety net for crashed fetch loop in `terminate_inactive_connections_loop()` - Reduce competing-fork sync spam threshold from 50 to 10 strikes
- Introduce cooldown timer and counter to prevent thundering herd on sync restarts - Implement increasing cooldown delay from 5s up to 10s for repeated failures - Skip duplicate sync restart triggers within cooldown window - Reset backoff counter on successful sync block processing - Clear accumulated sync blocks and reset peer state on full resync - Ensure missing blocks are re-requested after deferred resize events - Add brief sleep before full resync to allow in-flight messages to arrive
- Replace multiple concurrent sync restarts with a single resync() call protected by a progressive cooldown - Implement cooldown timings: starts at 5s and increases to max 10s, resets on successful block acceptance - Ensure full peer and global sync state reset in resync() to clear stale requests and received items - Fix issue where unlinkable blocks during sync gaps caused permanent missing blocks by clearing received sync items - Suppress thundering herd effect by skipping duplicate restarts within cooldown interval - Trigger start_synchronizing_with_peer() when node is stuck with no sync and head >30s behind real time - Adjust broadcast inventory suppression behavior in DLT mode to prevent deadlock situations
…blocks - Prevent blanking signing_key if it is already null (public_key_type()) - Ensure condition only applies when signing_key is valid - Improve logging accuracy for emergency and max missed blocks scenarios
- Sort peers by block number descending for better initial selection - Retry connection multiple times to handle transient server rejections - Streamline download phase with per-peer retry upon failure or checksum mismatch - Add robust error handling to try next peer on failures during download or verification - Verify snapshot checksum by streaming file to avoid large memory usage - Clean up temporary files on failure before attempting next peer - Log progress and errors clearly during all phases of snapshot download - Rename snapshot file to final path only after successful verification - Ensure fallback to next available peer until all trusted peers are exhausted
- Unpack Phase 2 snapshot info from peer response - Update best snapshot block number, checksum, and compressed size - Ensure metadata reflects most current peer snapshot state before download - Maintain snapshot size validation against maximum allowed size
- Allow block fetching from peers not busy with block requests while IDs are still fetched - Add guard to skip starting sync if an ID request is already pending for a peer - Log more precise peer request status distinguishing block vs ID request busy state - Prevent duplicate synchronization attempts when new peer connects with ongoing sync - Ensure high_block_num is included in blockchain synopsis to prevent disconnect in DLT mode - Fix edge case where block range midpoint step skips the reference high block number in synopsis
…tching - Add guard in get_blockchain_synopsis() to always include the high_block_num in the synopsis - Prevent "invalid response" disconnects caused by missing head block in peer synopsis validation - Implement duplicate sync-start guards in start_synchronizing_with_peer() and new_peer_just_added() - Skip redundant sync initiation if ID requests are already pending for a peer - Modify fetch_sync_items_loop() to allow block fetching once enough block IDs are collected - Remove blocking on peer idle state and differentiate "busy with blocks" vs "skipped-other" diagnostics - Introduce GRAPHENE_NET_MIN_BLOCK_IDS_TO_PREFETCH constant set to 10000 for concurrent fetching threshold
- Add transaction count to the block generation log message for better visibility - Include the number of transactions in the debug capture when broadcasting block - Enhance debug information during block production with transaction details
…point - Added logic to determine the originating peer endpoint from active connections - Passed the identified peer endpoint to handle_block instead of an empty optional - Enhanced tracking of items being processed to improve block acceptance handling
- Set we_need_sync_items_from_peer to false to fix synchronization logic - Update last stale block check position upon receiving a new block in p2p plugin - Improve tracking of last stale head block for sync status detection
…entation - Detail full activation process and guards in database update_global_dynamic_data - Explain emergency mode deactivation criteria and hybrid witness schedule override - Describe witness block production behavior and master/follower roles in emergency - Document fork database deterministic tie-breaking for emergency block conflicts - Clarify last irreversible block advancement caps during emergency mode - Outline startup recovery steps to repair emergency witness schedule - Describe P2P plugin guards for stale sync detection and emergency resync logic - Explain witness guard key restoration behavior specific to emergency mode - Include snapshot plugin handling of emergency state fields during import - Provide full system state diagram and component interaction mapping - Summarize all emergency mode guards, safety invariants, and versioning exclusions
…spam) When a DLT slave processes a burst of sync blocks (e.g. 21 blocks in 80ms), two reinforcing message storms stall block reception from the master: Sending side: send_sync_block_to_node_delegate() notifies EACH in-sync peer about EACH accepted block via fetch_next_batch_of_item_ids_from_peer(), generating a full get_blockchain_synopsis() per call. This creates O(N*M) synopsis computations that flood the event loop. Receiving side: When multiple peers send fetch_blockchain_item_ids_message requests simultaneously (triggered by our outbound notifications), we respond to each by calling the expensive get_block_ids(), generating redundant computations. Fix sending side: Add per-peer 5-second cooldown on in-sync notifications in send_sync_block_to_node_delegate(). Also skip if a synopsis request is already pending for the peer (item_ids_requested_from_peer). Fix receiving side: Add per-peer 5-second rate limit on get_block_ids() responses in on_fetch_blockchain_item_ids_message() for in-sync peers. Critical: only skip if our head block number hasn't changed since the last response -- if we've accepted new blocks, the peer needs updated IDs and we must respond normally even within the cooldown. New fields in peer_connection: - last_in_sync_notification_sent: cooldown guard for outbound notifications - last_fetch_item_ids_response_time: rate-limit guard for inbound responses - last_fetch_item_ids_response_head_num: prevents rate-limit from hiding new blocks when head has advanced since last response
…ss peer exchange - Introduce set_isolated_peers(bool) to enable isolated-peers mode in dlt_p2p_node - When enabled, only allow outbound connections to seed nodes and reject inbound from others - Suppress outbound peer exchange requests and respond with empty replies inbound - Log activation of isolated-peers mode - Add command-line option "p2p-isolated-peers" for configuring the isolated mode in p2p plugin - Close and clean up non-seed inbound connections when isolated mode is active - Prevent peer exchange processing and periodic peer exchange when in isolated-peers mode
- Check if the request body is empty in webserver plugin - Set response body to "empty request body" if empty - Set HTTP status to 400 Bad Request - Send HTTP response immediately to client - Avoid further processing when body is empty to improve error handling
- Add is_isolated_peers() method to p2p_plugin to check isolation status - Modify witness plugin to skip seed reconnect when peers are isolated - Ensure blocks do not trigger unnecessary seed reconnections in isolation - Improve logging to reflect conditional reconnect behavior
…k detection Enhance production loop diagnostics to detect root cause of silent block production failures observed in p71 logs where mad-max missed slots for 5+ minutes without meaningful error messages. Add slot=0 streak logging at 1s/2.5s/15s/30s intervals with full context: head_block_time vs now delta, NTP drift, next_slot_time, catching_up status, dlt_syncing status, and shuffled witness schedule. Detect if head_block_time is in the future (blocks with future timestamp applied). Add not_my_turn streak tracking to detect when witness is in schedule but slots continuously go to other witnesses. Warning at 125s consecutive with last scheduled witness name and configured witness list. Enhance WITNESS-WATCHDOG log to include not_my_turn_streak count and last scheduled witness for post-incident analysis.
- Track last applied block number to identify missed blocks - Connect applied_block signal handler to monitor missed slots - Detect if any missed slots belonged to our scheduled witnesses - Dump full plugin diagnostic state when our witness misses a slot - Log detailed witness status including keys and chain sync state - Limit inspected missed slots to 100 for performance reasons - Fix loop iteration type when listing top scheduled witnesses
- Add warnings at slot zero streak threshold 3 to reduce noise and capture meaningful gaps - Log detailed slot zero streak info including next scheduled witness and ownership status - Include next witness and ownership data in logs at thresholds 10, 60, and 120 for better diagnostics - Improve forced NTP resync log with witness context to aid drift investigations - Update prolonged stall and critical stall logs with shuffled witness info and next witness ownership status
…s block production The HF12 distressed-network path (participation < 33%) returned low_participation immediately, preventing the minority fork detection below from ever running. When this node holds majority witnesses and some are offline (blanked keys, missed slots), the participation rate drops below threshold and triggers a self-reinforcing deadlock: all majority nodes stop producing, participation stays low, network stalls. Replace the early return with a warning log so execution falls through to the fork_db-based minority fork detection, which accurately determines whether the node is isolated or merely experiencing low participation from offline witnesses.
…pation deadlock Two changes to prevent silent block production halt: 1. Remove premature `return low_participation` in HF12 distressed-network path. The participation rate heuristic causes false positives when multiple witnesses are offline, blocking production before the precise minority fork detection (fork_db scan) can run. Replace with a warning log so execution falls through to the fork_db-based check. 2. Add brute-force production recovery in the watchdog. When production has been silent beyond the threshold but the node is clearly operational (head advancing, FORWARD mode, peers connected, valid keys in schedule), force-reset every flag that could silently block production: _production_enabled, _minority_fork_recovering, _catchup_after_pause, currently_syncing. This covers any safety gate stuck due to race conditions or edge cases not yet diagnosed. Add clear_catchup_flag() to p2p_plugin and clear_catchup_after_pause() to dlt_p2p_node to support the watchdog recovery.
- Change _block_processing_paused from bool to std::atomic<bool> for thread-safe access - Change _catchup_after_pause from bool to std::atomic<bool> for safe concurrent reads and writes - Add comments explaining atomic usage and thread ownership for these flags
- Updated clear_catchup_after_pause() to reset _catchup_after_pause and _block_processing_paused - Improved witness watchdog recovery by handling snapshot/hot-reload pause failures - Enhanced block processing resumption after pause events to prevent stuck state
- Add clear_syncing() method to delegate interface to clear currently_syncing flag - Implement clear_syncing() in p2p_plugin that calls chain.clear_syncing() - Clear the delegate's syncing flag on every SYNC→FORWARD transition to avoid deadlock - Prevent indefinite witness production stall caused by stuck syncing flag - Add detailed comments explaining the deadlock scenario and fix rationale
- Add clear_syncing() method to dlt_p2p_delegate as a pure virtual function - Implement dlt_delegate::clear_syncing() to call chain.clear_syncing() - Call _delegate->clear_syncing() in transition_to_forward() before early return - Fix issue where currently_syncing flag remained true preventing block production - Prevents 570-second block production pause after scheduled snapshot - Ensures witness plugin no longer blocked due to stale syncing state
- Add detailed note explaining the difference and necessity of two partition guards: 'low_participation' and 'minority_fork' - Clarify scenarios each guard protects against and why both must be active - Describe operator override option 'enable-stale-production' to bypass low participation check - Modify witness plugin logic to return 'low_participation' condition when witness participation is below 33%, preventing concurrent chain building in network partitions - Update comments to emphasize safe stopping of production under low participation and guide for legitimate outage handling with override setting
…tation - Provide detailed explanation of each field in DLT P2P stats output - Describe node-level summary including status, fork, head, lib, peers, conn, paused, and uptime - Document per-peer statistics with lifecycle states, exchange status, spam strikes, and flags - Include disconnected and banned peer states with reconnection and ban policies - Outline common scenarios and recommended operator actions - Add quick reference for enumerations, lifecycle states, thresholds, and constants - Supply both English and Russian versions for broader accessibility
…ate cooldown description - Added detailed Scenario 6 explaining peer exchange rate-limiting log message - Described the mechanism enforcing 600-second cooldown between peer exchange requests - Clarified rationale and expected behavior of rate-limiting in peer communication - Updated `PEER_EXCHANGE_COOLDOWN_SEC` reference to specify cooldown applies per peer and enforced by both sides - Included guidance on interpreting rate-limited log messages and troubleshooting frequent occurrences
- Update documentation to reflect sliding window rate-limit mechanism - Change cooldown from fixed 10 minutes to max 3 requests per 5 minutes - Explain local tracking of request count and window start per peer - Clarify how rate-limit response indicates remaining wait time - Add new constants for max requests and sliding window duration fix(p2p): implement sliding window rate-limiting for peer exchanges - Replace fixed cooldown with sliding window (3 requests per 5 minutes) - Add counters and timestamps for request counts and window start - Modify is_peer_exchange_rate_limited() to check sliding window limits - Update request recording to reset or increment counters based on timing - Send accurate wait_seconds in rate-limit responses based on window - Mark peers locally as rate-limited upon receiving rate-limit response - Prevent periodic exchanges during active rate-limit window
…echanism - Replace fixed 10-minute cooldown with sliding window rate limit (3 requests per 5 minutes) - Adjust documentation to explain new rate limiting logic and variables - Add out-of-line definitions for new constexpr members: PEER_EXCHANGE_MAX_REQUESTS and PEER_EXCHANGE_WINDOW_SEC - Update peer exchange handling to prevent ephemeral port propagation with new limits - Clarify expected behavior and recommendations regarding frequent rate limit messages
- Introduce slot hijack counter to track consecutive hijacked blocks - Detect when committee produces blocks in our witness slots during emergencies - Log warnings for initial hijacks and periodically during ongoing hijacks - Reset hijack counter upon successful block production by our witness - Expose hijack count in witness debug strings for watchdog diagnostics - Improve visibility of emergency master interference with witness slots
…ction readiness - Added is_snapshot_in_progress() API to snapshot plugin for accurate snapshot status - Added snapshot plugin dependency to witness plugin with APPBASE_PLUGIN_REQUIRES - Removed _production_enabled cached flag and replaced with direct database and plugin queries - Updated production readiness checks to query chain.syncing, snapshot in progress, emergency mode, participation rate, and minority fork recovery state freshly each tick - Refined watchdog recovery to clear blocking conditions without setting cached flags - Enhanced diagnostic logging to show actual skip flags and live state instead of cached booleans - Simplified recovery logic and eliminated race conditions caused by stale cached flags - Maintained full backward compatibility and identical production behavior with improved correctness
- Added snapshot::snapshot_plugin to witness plugin dependencies - Query snapshot plugin’s is_snapshot_in_progress() to defer block production during snapshot creation - Replace _production_enabled flag setting with _production_skip_flags containing skip_undo_history_check bit - Derive production “should_be_producing” state dynamically using live DB properties and emergency consensus state - Refined production enablement logic to bypass stale-production and sync checks only when override flags set - Improved watchdog conditions based on live state and fresh sync time checks, removing cached _production_enabled reliance - Updated legacy and emergency master production handling to query sync fresh each tick without cached flags - Adjust block production gating to check snapshot plugin flag before P2P catchup after pause - Updated slot hijack and missed slot detection condition to remove _production_enabled requirement - Revised diagnostic output format to show skip_flags instead of prod_enabled for clarity - Documented snapshot plugin API usage and integration details for witness plugin block production flow
…ader - Added snapshot plugin include and dependency registration in witness.cpp - Declared plugin_for_each_dependency method in witness plugin class - Avoided exposing snapshot plugin headers in witness.hpp to reduce coupling - Updated CMakeLists.txt to include snapshot plugin headers privately - Adjusted plugin dependencies to be registered via function in implementation file
- Fix variable usage to refer to the correct hijack state and aslot values - Simplify check for scheduled witnesses by removing empty vector check - Update log messages to use format string with named arguments - Log first 3 hijacks and then once per minute to reduce log flooding - Reset hijack counter when our witness produces the expected block slot
…lization - Change snapshot_load_callback to accept snapshot file path argument - Pass snapshot path to callback during chain plugin startup - Move database hardfork initialization to after snapshot loading - Adjust logging to show snapshot path passed at runtime - Always register snapshot_load_callback for auto recovery use - Handle snapshot load exceptions with shared memory cleanup
…ync task - Add flag_guard struct to manage snapshot_in_progress flag with explicit release method - Release flag and resume P2P processing after DB read lock drops in snapshot callback - Log message confirms DB read completion and background compression/file writing - Safely resume P2P if create_snapshot throws before callback execution - Use memory_order_release for atomic flag updates to ensure proper synchronization
- Add catch block for shared_memory_corruption_exception during block acceptance - Log the corruption details with block number - Trigger chain auto-recovery to prevent node from getting stuck - Return rejected result after initiating recovery to retry processing - Prevent exception from falling through to generic handler causing silent rejection
- Include p2p_plugin in chain plugin dependencies and headers - Pause all P2P database consumers before closing the corrupted database to prevent data corruption - Mark syncing state to defer witness block production during recovery - Set snapshot path and trigger full database recovery (wipe, import, DLT replay) - Resume P2P block processing after the database is fully rebuilt - Add error handling and logging for pause/resume operations on P2P plugin
- Added PRIVATE include directory for p2p plugin headers in chain plugin - Updated target_include_directories with new path in CMakeLists.txt - Ensured proper separation of PUBLIC and PRIVATE include paths - Adjusted CMakeLists.txt formatting for consistency
- Replace deprecated graphene_p2p_plugin with graphene::p2p - Adjust CMake target dependencies accordingly - Maintain target include directories without changes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.