Skip to content

refactor: improve HTTP/2 GOAWAY and network error retry resilience#4594

Open
deepak1556 wants to merge 1 commit intomainfrom
robo/refactor_transient_error_retries
Open

refactor: improve HTTP/2 GOAWAY and network error retry resilience#4594
deepak1556 wants to merge 1 commit intomainfrom
robo/refactor_transient_error_retries

Conversation

@deepak1556
Copy link
Contributor

  • Retry graceful HTTP/2 GOAWAY (code 0) transparently in node-fetcher before surfacing as an error.

  • Retry transient electron errors (ERR_NETWORK_CHANGED, etc.) on the same fetcher with a fresh connection in fetchWithFallbacks before falling back to a different fetcher.

  • Fix canRetryOnceNetworkError to also check error .message for electron net:: errors, which were previously silently missed on the networkRequest path.

@deepak1556 deepak1556 added this to the 1.113.0 milestone Mar 22, 2026
@deepak1556 deepak1556 requested a review from chrmarti March 22, 2026 06:11
@deepak1556 deepak1556 self-assigned this Mar 22, 2026
@deepak1556 deepak1556 force-pushed the robo/refactor_transient_error_retries branch from 215b937 to cd9dfed Compare March 22, 2026 10:58
@deepak1556 deepak1556 marked this pull request as ready for review March 22, 2026 10:58
Copilot AI review requested due to automatic review settings March 22, 2026 10:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience in the networking layer by adding targeted retry behavior for specific transient failures (HTTP/2 graceful GOAWAY and select Electron net:: errors) and by refining which network errors qualify for a one-time retry.

Changes:

  • Add a transparent retry in NodeFetchFetcher when an HTTP/2 GOAWAY with code 0 (NO_ERROR) is detected.
  • Add “disconnect + retry once” handling for specific transient Electron net::... errors in fetchWithFallbacks, with new telemetry for retry outcomes.
  • Update canRetryOnceNetworkError to recognize retryable Electron net::... errors by message (in addition to code) and refactor the list into a Set.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/platform/networking/vscode-node/test/fetcherServiceCrash.spec.ts Adds tests covering net::ERR_NETWORK_CHANGED retry behavior on the same fetcher.
src/platform/networking/node/test/nodeFetchGoaway.spec.ts Adds unit tests for GOAWAY (code 0) detection helper.
src/platform/networking/node/nodeFetchFetcher.ts Retries fetch() once when a graceful GOAWAY error is detected; exports detection helper.
src/platform/networking/node/fetcherFallback.ts Introduces transient Electron error retry helper + telemetry and reuses it for crash retry path.
src/platform/networking/common/networking.ts Expands one-time retry classification to include Electron net::... errors by message and uses a Set.

@deepak1556 deepak1556 force-pushed the robo/refactor_transient_error_retries branch from cd9dfed to b58ac62 Compare March 22, 2026 11:27
@deepak1556 deepak1556 requested a review from Copilot March 22, 2026 11:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

- Retry graceful HTTP/2 GOAWAY (code 0) transparently in node-fetcher
  before surfacing as an error.

- Retry transient electron errors (ERR_NETWORK_CHANGED, etc.) on the
  same fetcher with a fresh connection in fetchWithFallbacks before
  falling back to a different fetcher.

- Fix canRetryOnceNetworkError to also check error .message for electron
  net:: errors, which were previously silently missed on the networkRequest
  path.
@deepak1556 deepak1556 force-pushed the robo/refactor_transient_error_retries branch from b58ac62 to 2aacd10 Compare March 22, 2026 14:05
@deepak1556 deepak1556 modified the milestones: 1.113.0, 1.114.0 Mar 22, 2026
@deepak1556 deepak1556 requested a review from Copilot March 22, 2026 14:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

@@ -435,19 +435,23 @@ function networkRequest(
}

export function canRetryOnceNetworkError(reason: any) {
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry behavior in canRetryOnceNetworkError changed (now supports matching reason.message), but there are no unit tests covering the accepted/rejected cases (e.g. matching by reason.code vs reason.message, and the ERR_* vs net::ERR_* forms). Adding focused tests for this helper would help prevent regressions in networkRequest retry behavior.

Suggested change
export function canRetryOnceNetworkError(reason: any) {
/**
* Returns true if the given error describes a network condition where we should
* attempt a single retry of the request.
*
* This helper supports both:
* - Node.js / undici error codes (via `reason.code`, e.g. "ECONNRESET", "ETIMEDOUT", "ERR_HTTP2_GOAWAY_SESSION")
* - Electron-style error messages (via `reason.message`, e.g. "net::ERR_CONNECTION_RESET")
*
* When adding or changing the entries in {@link retryableNetworkErrors}, unit tests
* should verify the accepted and rejected cases for both `reason.code` and
* `reason.message`, including the `ERR_*` vs `net::ERR_*` forms.
*/
export function canRetryOnceNetworkError(reason: { code?: string; message?: string } | null | undefined) {

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@chrmarti chrmarti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions, thanks!

* These occur during server-side connection draining
* and are safe to retry on a new connection.
*/
export function isGracefulGoawayError(e: any): boolean {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems GOAWAY is on the session level and comes with a Last-Stream-ID that indicates which requests might still have been processed. Only those with later ids can be retried safely if I understand correctly. Undici doesn't seem to handle or surface Last-Stream-ID currently.

return await super.fetch(url, options);
} catch (e) {
if (isGracefulGoawayError(e)) {
await this.disconnectAll();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to disconnect all? nodejs/undici#3140 seems to suggest that this is done automatically. Does still doing this here risk cancelling newly issued requests? (E.g. when several requests of a session get the same GOAWAY, could they step on each other here?)

const fetcherId = fetcher.getUserAgentLibrary();
logService.info(`FetcherService: ${fetcherId} hit ${reason}, retrying after disconnect...`);
try {
await fetcher.disconnectAll();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We disconnectAll for all 3 error types we retry here? Could this make requests step on each other as some are being retried while later ones come here to first disconnect all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants