refactor: improve HTTP/2 GOAWAY and network error retry resilience#4594
refactor: improve HTTP/2 GOAWAY and network error retry resilience#4594deepak1556 wants to merge 1 commit intomainfrom
Conversation
215b937 to
cd9dfed
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves resilience in the networking layer by adding targeted retry behavior for specific transient failures (HTTP/2 graceful GOAWAY and select Electron net:: errors) and by refining which network errors qualify for a one-time retry.
Changes:
- Add a transparent retry in
NodeFetchFetcherwhen an HTTP/2 GOAWAY with code 0 (NO_ERROR) is detected. - Add “disconnect + retry once” handling for specific transient Electron
net::...errors infetchWithFallbacks, with new telemetry for retry outcomes. - Update
canRetryOnceNetworkErrorto recognize retryable Electronnet::...errors by message (in addition to code) and refactor the list into aSet.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/platform/networking/vscode-node/test/fetcherServiceCrash.spec.ts | Adds tests covering net::ERR_NETWORK_CHANGED retry behavior on the same fetcher. |
| src/platform/networking/node/test/nodeFetchGoaway.spec.ts | Adds unit tests for GOAWAY (code 0) detection helper. |
| src/platform/networking/node/nodeFetchFetcher.ts | Retries fetch() once when a graceful GOAWAY error is detected; exports detection helper. |
| src/platform/networking/node/fetcherFallback.ts | Introduces transient Electron error retry helper + telemetry and reuses it for crash retry path. |
| src/platform/networking/common/networking.ts | Expands one-time retry classification to include Electron net::... errors by message and uses a Set. |
cd9dfed to
b58ac62
Compare
- Retry graceful HTTP/2 GOAWAY (code 0) transparently in node-fetcher before surfacing as an error. - Retry transient electron errors (ERR_NETWORK_CHANGED, etc.) on the same fetcher with a fresh connection in fetchWithFallbacks before falling back to a different fetcher. - Fix canRetryOnceNetworkError to also check error .message for electron net:: errors, which were previously silently missed on the networkRequest path.
b58ac62 to
2aacd10
Compare
| @@ -435,19 +435,23 @@ function networkRequest( | |||
| } | |||
|
|
|||
| export function canRetryOnceNetworkError(reason: any) { | |||
There was a problem hiding this comment.
The retry behavior in canRetryOnceNetworkError changed (now supports matching reason.message), but there are no unit tests covering the accepted/rejected cases (e.g. matching by reason.code vs reason.message, and the ERR_* vs net::ERR_* forms). Adding focused tests for this helper would help prevent regressions in networkRequest retry behavior.
| export function canRetryOnceNetworkError(reason: any) { | |
| /** | |
| * Returns true if the given error describes a network condition where we should | |
| * attempt a single retry of the request. | |
| * | |
| * This helper supports both: | |
| * - Node.js / undici error codes (via `reason.code`, e.g. "ECONNRESET", "ETIMEDOUT", "ERR_HTTP2_GOAWAY_SESSION") | |
| * - Electron-style error messages (via `reason.message`, e.g. "net::ERR_CONNECTION_RESET") | |
| * | |
| * When adding or changing the entries in {@link retryableNetworkErrors}, unit tests | |
| * should verify the accepted and rejected cases for both `reason.code` and | |
| * `reason.message`, including the `ERR_*` vs `net::ERR_*` forms. | |
| */ | |
| export function canRetryOnceNetworkError(reason: { code?: string; message?: string } | null | undefined) { |
chrmarti
left a comment
There was a problem hiding this comment.
Left a few questions, thanks!
| * These occur during server-side connection draining | ||
| * and are safe to retry on a new connection. | ||
| */ | ||
| export function isGracefulGoawayError(e: any): boolean { |
There was a problem hiding this comment.
It seems GOAWAY is on the session level and comes with a Last-Stream-ID that indicates which requests might still have been processed. Only those with later ids can be retried safely if I understand correctly. Undici doesn't seem to handle or surface Last-Stream-ID currently.
| return await super.fetch(url, options); | ||
| } catch (e) { | ||
| if (isGracefulGoawayError(e)) { | ||
| await this.disconnectAll(); |
There was a problem hiding this comment.
Do we need to disconnect all? nodejs/undici#3140 seems to suggest that this is done automatically. Does still doing this here risk cancelling newly issued requests? (E.g. when several requests of a session get the same GOAWAY, could they step on each other here?)
| const fetcherId = fetcher.getUserAgentLibrary(); | ||
| logService.info(`FetcherService: ${fetcherId} hit ${reason}, retrying after disconnect...`); | ||
| try { | ||
| await fetcher.disconnectAll(); |
There was a problem hiding this comment.
We disconnectAll for all 3 error types we retry here? Could this make requests step on each other as some are being retried while later ones come here to first disconnect all?
Retry graceful HTTP/2 GOAWAY (code 0) transparently in node-fetcher before surfacing as an error.
Retry transient electron errors (ERR_NETWORK_CHANGED, etc.) on the same fetcher with a fresh connection in fetchWithFallbacks before falling back to a different fetcher.
Fix canRetryOnceNetworkError to also check error .message for electron net:: errors, which were previously silently missed on the networkRequest path.