Fix infinite 401 retry loop when app logs resubscribe fails#7055
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
We detected some changes at Caution DO NOT create changesets for features which you do not wish to be included in the public changelog of the next CLI release. |
Coverage report
Test suite run success3938 tests passing in 1514 suites. Report generated by 🧪jest coverage report action from 59476a1 |
craigmichaelmartin
left a comment
There was a problem hiding this comment.
Review assisted by pair-review
Code Review from River 🌊Overall assessment: Targeted, correct fix for the immediate symptom. Should ship — it'll reduce load by ~99% for this failure mode. But a follow-up is needed. What this PR gets right
What remains unfixed
Observability contextFrom Observe data (last 1h), the App Management GraphQL endpoint is receiving ~22K 401 requests/hour — all of them are |
dmerand
left a comment
There was a problem hiding this comment.
Dropping in a bug that looks like it needs addressing.
| @@ -109,17 +110,26 @@ export interface AppLogsOptions { | |||
|
|
|||
| export const handleFetchAppLogsError = async ( | |||
There was a problem hiding this comment.
resubscribeFailed is not enough to drive the new counter reset logic. A false value currently means either "resubscribe succeeded" or "this was some other error path, like 429/500, so no resubscribe happened". All three callers reset their consecutive-failure counter on false, which means unrelated errors between two 401 resubscribe failures can clear the counter and prevent the cutoff from triggering when it should.
Either return a distinct state for successful resubscribe vs. non-401 errors, or have the callers reset only when a fresh token is actually returned. This affects [[file:packages/app/src/cli/services/app-logs/dev/poll-app-logs.ts:107-117]], [[file:packages/app/src/cli/services/app-logs/logs-command/render-json-logs.ts:44-54]], and [[file:packages/app/src/cli/services/app-logs/logs-command/ui/components/hooks/usePollAppLogs.ts:169-179]].
There was a problem hiding this comment.
Resolved — replaced resubscribeFailed: boolean with a tri-state resubscribeResult: 'succeeded' | 'failed' | 'not_attempted'. Callers now only reset the consecutive-failure counter on 'succeeded' and leave it unchanged on 'not_attempted' (429/500/other non-401 errors), so unrelated errors no longer clear the counter.
When the JWT for app log polling expires, handleFetchAppLogsError calls onResubscribe() to get a fresh token. If onResubscribe() itself throws (network error, expired session), the exception propagates unhandled, causing the dev polling loop to retry every 5s with the same expired JWT — an infinite 401 loop. Wrap onResubscribe() in a try-catch so failures fall back to a 60s retry interval instead of propagating. This is a 12x reduction in 401 rate per affected client and allows self-healing when the underlying issue resolves.
45f916a to
9f2f644
Compare
9f2f644 to
59476a1
Compare
|
River's assessment is wrong in one thing, |

Summary
onResubscribe()inhandleFetchAppLogsErrorwith try-catch so resubscribe failures use a 60s retry interval instead of propagating as unhandled exceptionsdev/poll-app-logs.ts(catch block retried every 5s with the same expired JWT)logs-command/render-json-logs.tslogs-command/usePollAppLogs.tspoll-app-logsandrender-json-logsWHY are these changes introduced?
WHAT is this pull request doing?
How to test your changes?
Post-release steps
Measuring impact
How do we know this change was effective? Please choose one:
Checklist