Skip to content

fix: event delivery race on resume + stale worker detection after relaunch#559

Merged
PureWeen merged 4 commits intomainfrom
fix/health-monitoring-continued
Apr 8, 2026
Merged

fix: event delivery race on resume + stale worker detection after relaunch#559
PureWeen merged 4 commits intomainfrom
fix/health-monitoring-continued

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Apr 8, 2026

What

Two reliability fixes found during live monitoring.

1. Event delivery race on ResumeSessionAsync (root cause of empty-response workers)

The bug: PolyPilot called .On(handler) AFTER ResumeSessionAsync returned. The SDK's ProcessEventsAsync loop starts in the CopilotSession constructor and immediately consumes events from the channel. Events dispatched between the resume RPC response and the .On() registration were consumed with an empty handler list — silently dropped.

This caused the recurring 'empty response' worker pattern: lifecycle events (TurnStart/TurnEnd) appeared to flow because they arrived after .On(), but content events (MessageDelta) sent immediately after session.resume were lost in the race window. The worker would run for 120-600s, produce zero flushed content, get watchdog-killed, then revived — repeating the cycle.

The fix: Pass the event handler via ResumeSessionConfig.OnEvent, which the SDK registers BEFORE sending the session.resume RPC. The SessionState is pre-created with minimal info and populated after resume returns.

This is a PolyPilot bug, not an SDK bug — the SDK provides OnEvent specifically to avoid this race.

2. Stale events.jsonl detection after relaunch

After a relaunch, IsSessionStillProcessing checks events.jsonl for non-terminal events. If the CLI finished a turn during the relaunch window, session.idle was NOT written (ephemeral), so the last event looks active. Workers sat with stale IsProcessing=true for 600s until the watchdog killed them.

Fix: Added a 30s mtime freshness check — if events.jsonl hasn't been written to in 30s, the CLI is done. Downgrades from deferred processing to eager resume.

Testing

  • 3310/3310 tests pass (1 flaky ServerManagerTests unrelated)
  • Structural regression test: ResumeSessionAsync_UsesOnEventInConfig_NotPostResumeOn
  • Applied stale-file fix to both main repo and orchestrator worktree

PureWeen added a commit that referenced this pull request Apr 8, 2026
… to _sessions

Two fixes from PR #559 review:

1. Stale-file threshold: replaced 30s fixed threshold with app-start-time
   comparison. Only downgrades to eager resume if events.jsonl was last
   written BEFORE the app started — meaning the CLI finished during the
   restart window. Long-running tools that are silent >30s are no longer
   misclassified as idle.

2. Event delivery race (full fix): state is now published to _sessions
   BEFORE ResumeSessionAsync so that HandleSessionEvent's isCurrentState
   check passes for events arriving during the resume RPC. Previously,
   early events were dropped because _sessions[name] != state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 8, 2026

Re-review summary

CI status: ⚠️ No checks reported on this branch
Prior review comments: re-checked against the earlier request-changes review; the previously reported blockers now appear resolved
Consensus: 3/3 reviewers agree the remaining fallback-path regression has been fixed

Prior findings status

  1. ✅ FIXED — pre-published state no longer collides with a stale later TryAdd
    The success path now verifies ownership with TryGetValue + ReferenceEquals instead of trying to insert the same key twice.

  2. ✅ FIXED — transient resume failures no longer erase the restored placeholder session
    The catch block now restores the previous state instead of always removing the key.

  3. ✅ FIXED — the earlier active-but-quiet-tool downgrade concern is no longer present in this revision
    The problematic restore-time downgrade logic is gone from the current diff.

  4. ✅ FIXED — resumed sessions keep MCP servers / skill directories on both resume paths
    BuildResumeSessionConfig(...) restores the missing MCP/skill wiring for lazy and explicit resume.

  5. ✅ FIXED — the previously reported post-resume History synchronization issue is now addressed
    The remaining history finalization work has been moved behind the dedicated UI-thread helper.

  6. ✅ FIXED — lazy-resume fallback now attaches an event handler in the fresh-create recovery path
    The fallback branch now re-registers copilotSession.On(evt => HandleSessionEvent(state, evt)), closing the last remaining deaf-session regression.

Findings

No new substantive regressions found in the current diff.

Test coverage

The updated test suite now covers the key structural regressions that surfaced during review:

  • resume config uses OnEvent instead of post-resume .On(...)
  • resume and reconnect paths preserve MCP servers / skill directories
  • resumed-session UI finalization runs through the dedicated helper
  • lazy-resume fallback retains event-handler wiring

A fully behavioral end-to-end fallback-resume test would still be a nice future improvement, but I do not see a current blocking correctness issue in this revision.

Recommendation

✅ Approve

The original event-delivery race fix is now wired through correctly, and the follow-up regressions raised during review have been addressed.

PureWeen added a commit that referenced this pull request Apr 8, 2026
Addresses PR #559 re-review findings:

1. Removed the app-start-time stale-file heuristic entirely. Any override
   of IsSessionStillProcessing risks killing legitimate long-running tools
   that started before the relaunch and went quiet. The existing watchdog
   (30s/120s/600s tiers) handles finished-during-restart correctly.

2. Populated history into state.Info BEFORE pre-publishing to _sessions.
   This ensures event dedup logic has persisted entries to match against
   when replayed resume-time events arrive. Without this, empty History
   caused duplicate tool/message entries in the UI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use OnEvent in ResumeSessionConfig to prevent event delivery race
  (root cause of empty-response workers — PolyPilot bug, not SDK)
- Pre-publish state to _sessions before resume RPC so HandleSessionEvent
  passes isCurrentState check for early events
- Populate history before pre-publish for event dedup correctness
- Rehydrate history on session-ID remap (synchronized via InvokeOnUI)
- Preserve previous placeholder state on resume failure
- Add use-local-sdk.sh for building against Copilot SDK source
- Add structural regression test for OnEvent usage

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/health-monitoring-continued branch from a65ad6a to 2a8f176 Compare April 8, 2026 18:18
PureWeen and others added 3 commits April 8, 2026 14:05
Restore MCP/skill loading for resume paths and serialize post-resume history finalization on the UI thread.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The lazy-resume fallback path (session not found / corrupt) creates a fresh session via BuildFreshSessionConfig + CreateSessionAsync, but the old .On() call was removed when switching to ResumeSessionConfig.OnEvent. The fresh SessionConfig has no OnEvent property, so the fallback session was deaf — no events delivered.

Fix: explicitly call copilotSession.On() in the fallback branch, matching the pattern used in worker revival and MCP reload paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 2030231 into main Apr 8, 2026
@PureWeen PureWeen deleted the fix/health-monitoring-continued branch April 8, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant