fix: event delivery race on resume + stale worker detection after relaunch#559
fix: event delivery race on resume + stale worker detection after relaunch#559
Conversation
… to _sessions Two fixes from PR #559 review: 1. Stale-file threshold: replaced 30s fixed threshold with app-start-time comparison. Only downgrades to eager resume if events.jsonl was last written BEFORE the app started — meaning the CLI finished during the restart window. Long-running tools that are silent >30s are no longer misclassified as idle. 2. Event delivery race (full fix): state is now published to _sessions BEFORE ResumeSessionAsync so that HandleSessionEvent's isCurrentState check passes for events arriving during the resume RPC. Previously, early events were dropped because _sessions[name] != state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re-review summaryCI status: Prior findings status
FindingsNo new substantive regressions found in the current diff. Test coverageThe updated test suite now covers the key structural regressions that surfaced during review:
A fully behavioral end-to-end fallback-resume test would still be a nice future improvement, but I do not see a current blocking correctness issue in this revision. Recommendation✅ Approve The original event-delivery race fix is now wired through correctly, and the follow-up regressions raised during review have been addressed. |
Addresses PR #559 re-review findings: 1. Removed the app-start-time stale-file heuristic entirely. Any override of IsSessionStillProcessing risks killing legitimate long-running tools that started before the relaunch and went quiet. The existing watchdog (30s/120s/600s tiers) handles finished-during-restart correctly. 2. Populated history into state.Info BEFORE pre-publishing to _sessions. This ensures event dedup logic has persisted entries to match against when replayed resume-time events arrive. Without this, empty History caused duplicate tool/message entries in the UI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use OnEvent in ResumeSessionConfig to prevent event delivery race (root cause of empty-response workers — PolyPilot bug, not SDK) - Pre-publish state to _sessions before resume RPC so HandleSessionEvent passes isCurrentState check for early events - Populate history before pre-publish for event dedup correctness - Rehydrate history on session-ID remap (synchronized via InvokeOnUI) - Preserve previous placeholder state on resume failure - Add use-local-sdk.sh for building against Copilot SDK source - Add structural regression test for OnEvent usage Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a65ad6a to
2a8f176
Compare
Restore MCP/skill loading for resume paths and serialize post-resume history finalization on the UI thread. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The lazy-resume fallback path (session not found / corrupt) creates a fresh session via BuildFreshSessionConfig + CreateSessionAsync, but the old .On() call was removed when switching to ResumeSessionConfig.OnEvent. The fresh SessionConfig has no OnEvent property, so the fallback session was deaf — no events delivered. Fix: explicitly call copilotSession.On() in the fallback branch, matching the pattern used in worker revival and MCP reload paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
Two reliability fixes found during live monitoring.
1. Event delivery race on ResumeSessionAsync (root cause of empty-response workers)
The bug: PolyPilot called
.On(handler)AFTERResumeSessionAsyncreturned. The SDK'sProcessEventsAsyncloop starts in theCopilotSessionconstructor and immediately consumes events from the channel. Events dispatched between the resume RPC response and the.On()registration were consumed with an empty handler list — silently dropped.This caused the recurring 'empty response' worker pattern: lifecycle events (TurnStart/TurnEnd) appeared to flow because they arrived after
.On(), but content events (MessageDelta) sent immediately aftersession.resumewere lost in the race window. The worker would run for 120-600s, produce zero flushed content, get watchdog-killed, then revived — repeating the cycle.The fix: Pass the event handler via
ResumeSessionConfig.OnEvent, which the SDK registers BEFORE sending thesession.resumeRPC. TheSessionStateis pre-created with minimal info and populated after resume returns.This is a PolyPilot bug, not an SDK bug — the SDK provides
OnEventspecifically to avoid this race.2. Stale events.jsonl detection after relaunch
After a relaunch,
IsSessionStillProcessingchecksevents.jsonlfor non-terminal events. If the CLI finished a turn during the relaunch window,session.idlewas NOT written (ephemeral), so the last event looks active. Workers sat with staleIsProcessing=truefor 600s until the watchdog killed them.Fix: Added a 30s mtime freshness check — if
events.jsonlhasn't been written to in 30s, the CLI is done. Downgrades from deferred processing to eager resume.Testing
ServerManagerTestsunrelated)ResumeSessionAsync_UsesOnEventInConfig_NotPostResumeOn