Skip to content

fix(publish): cap concurrent WebSocket connections with semaphore#1781

Closed
mw2000 wants to merge 1 commit intomainfrom
mw2000/fix-ws-listener-dos
Closed

fix(publish): cap concurrent WebSocket connections with semaphore#1781
mw2000 wants to merge 1 commit intomainfrom
mw2000/fix-ws-listener-dos

Conversation

@mw2000
Copy link
Copy Markdown
Contributor

@mw2000 mw2000 commented Mar 29, 2026

Builds on the handshake timeout and spawned-handshake improvements from #1252.

Summary

  • Adds a tokio::sync::Semaphore to bound the number of concurrent WebSocket connections (handshaking + active) to a configurable limit (default 8192)
  • Adds on_connection_rejected to the PublisherMetrics trait so rejected connections are observable
  • Adds Listener::with_max_connections builder method to override the default limit

Problem

There is no bound on the number of concurrently spawned tasks in the WebSocket listener. Under heavy connection pressure, each new TCP connection spawns a task that lives for up to HANDSHAKE_TIMEOUT (10s) even if it never completes the upgrade. At sustained rates this can exhaust memory and task-scheduler resources before timeouts reclaim anything.

Fix

Guard tokio::spawn with an Arc<Semaphore> using try_acquire_owned. The permit is held for the entire lifetime of the spawned task (handshake + broadcast), so it is released when the client disconnects. Connections beyond the limit are dropped immediately with a debug log and a metric callback.

Testing

All 25 tests pass (cargo test -p base-builder-publish), including a new listener_rejects_connections_beyond_limit test that verifies the semaphore rejects excess connections.

@cb-heimdall
Copy link
Copy Markdown
Collaborator

cb-heimdall commented Mar 29, 2026

🟡 Heimdall Review Status

Requirement Status More Info
Reviews 🟡 0/1
Denominator calculation
Show calculation
1 if user is bot 0
1 if user is external 0
2 if repo is sensitive 0
From .codeflow.yml 1
Additional review requirements
Show calculation
Max 0
0
From CODEOWNERS 0
Global minimum 0
Max 1
1
1 if commit is unverified 1
Sum 2

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
base Ignored Ignored Preview Apr 8, 2026 0:53am

Request Review

Comment thread crates/builder/publish/src/listener.rs Outdated
@mw2000 mw2000 changed the title fix(publish): spawn WebSocket handshake to prevent listener DoS fix(builder): spawn WebSocket handshake to prevent listener DoS Mar 30, 2026
@mw2000 mw2000 marked this pull request as ready for review March 30, 2026 06:38
Comment thread crates/builder/publish/src/listener.rs
@mw2000 mw2000 changed the title fix(builder): spawn WebSocket handshake to prevent listener DoS fix(builder): spawn WebSocket handshake Mar 30, 2026
@mw2000 mw2000 requested review from danyalprout and niran March 31, 2026 20:12
@mw2000 mw2000 force-pushed the mw2000/fix-ws-listener-dos branch from ced8b30 to 709113e Compare April 8, 2026 00:25
Comment thread crates/builder/publish/src/listener.rs Outdated
@mw2000 mw2000 changed the title fix(builder): spawn WebSocket handshake fix(publish): cap concurrent WebSocket connections with semaphore Apr 8, 2026
@mw2000 mw2000 force-pushed the mw2000/fix-ws-listener-dos branch from 709113e to b79d937 Compare April 8, 2026 00:30
@mw2000 mw2000 removed request for danyalprout and niran April 8, 2026 00:30
@mw2000 mw2000 force-pushed the mw2000/fix-ws-listener-dos branch from b79d937 to 315e018 Compare April 8, 2026 00:36
@mw2000 mw2000 marked this pull request as draft April 8, 2026 00:49
Add a tokio::sync::Semaphore to bound the number of concurrent
handshake + broadcast tasks. Without a limit, heavy connection
pressure can exhaust memory and task-scheduler resources before
handshake timeouts reclaim anything. Connections beyond the limit are
dropped immediately with an on_connection_rejected metric callback.

Builds on the handshake timeout and spawned-handshake improvements
from #1252.
@mw2000 mw2000 force-pushed the mw2000/fix-ws-listener-dos branch from 315e018 to aa3b6d3 Compare April 8, 2026 00:53
///
/// Defaults to [`DEFAULT_MAX_CONNECTIONS`] (8192).
pub const fn with_max_connections(mut self, max_connections: usize) -> Self {
self.max_connections = max_connections;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with_max_connections(0) will create a Semaphore::new(0) that rejects every connection unconditionally. Consider validating the input or using NonZeroUsize:

Suggested change
self.max_connections = max_connections;
pub const fn with_max_connections(mut self, max_connections: NonZeroUsize) -> Self {
self.max_connections = max_connections.get();
self

Alternatively, a runtime assert!(max_connections > 0) or saturating to 1 would also work. Not a blocker, but a zero-capacity listener is almost certainly a misconfiguration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using this for the unit test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Review Summary

Solid PR that addresses a real DoS vector. The semaphore-based approach is the right tool for the job — try_acquire_owned is non-blocking and the owned permit naturally ties the connection lifetime to the semaphore slot.

What looks good

  • Semaphore placement: Acquiring before tokio::spawn means the limit covers both handshaking and active connections, which is the correct granularity.
  • debug! log level: Previous review flagged warn! as a secondary DoS vector under flood conditions — addressed correctly.
  • Default body on on_connection_rejected: Since all PublisherMetrics impls are internal to this crate, the default no-op is fine and avoids a breaking change to the trait.
  • Test coverage: The new test exercises the reject path with with_max_connections(1) and verifies the metric counter.

Finding

  • with_max_connections(0) is silently accepted — creates a listener that rejects 100% of connections. Consider using NonZeroUsize or adding a runtime assertion. (minor, posted inline)

@mw2000
Copy link
Copy Markdown
Contributor Author

mw2000 commented Apr 8, 2026

Upon some further review, we will hold off/close this PR

The main reason is the number of connections on mainnet are too low (~30) since the builder mostly sits behind websocket proxy.

@mw2000 mw2000 closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants