Ephemeral/JIT runner reports "lost communication" despite successful job completion — no grace period before broker disconnect

Ephemeral (one-time-use) JIT runners that successfully complete a job are frequently reported by GitHub as "The self-hosted runner lost communication with the server," even though:

  1. The worker process exits with code 100 (success)
  2. CompleteJobAsync succeeds — the listener logs "finish job request for job {id} with result: Succeeded"
  3. The broker confirms the state change — "Received job status event. JobState: Online"
  4. The runner cleanly deletes its session and exits with return code 0

The GitHub UI then either:
  - Shows the job stuck as "queued" or "in_progress" for 10-30+ minutes before eventually updating
  - Shows "The self-hosted runner lost communication with the server" despite the job having completed successfully
  - In some cases, never delivers the workflow_job.completed webhook

Root Cause

In `Runner.cs` line 576, after the one-time-use job completes, the message queue is cancelled immediately with zero grace period:

```
 // Line 570-576
  Task completeTask = await Task.WhenAny(getNextMessage, jobDispatcher.RunOnceJobCompleted.Task);
  if (completeTask == jobDispatcher.RunOnceJobCompleted.Task)
  {
      runOnceJobCompleted = true;
      Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");
      Trace.Info("Stop message queue looping.");
      messageQueueLoopTokenSource.Cancel();  // <-- immediate teardown, no grace period
```

 This cancels the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message), which severs the TCP connection. GitHub's broker health monitoring detects the disconnect and flags "runner lost communication" — before GitHub's internal pipeline service has propagated the job completion to the webhook/UI systems.

The race is between two independent GitHub backend systems:
  1. Pipeline service — received CompleteJobAsync, knows the job succeeded
  2. Broker health monitor — sees TCP disconnect, flags the runner as lost

When the broker health monitor wins the race, the job is marked as failed/lost despite having completed.

Evidence

Runner diagnostic logs showing the complete successful flow immediately followed by the forced disconnect:

```
  [08:03:33Z INFO] Worker finished for job f19fed06-... Code: 100
  [08:03:33Z INFO] finish job request for job f19fed06-... with result: Succeeded
  [08:03:33Z INFO] Job X Build completed with result: Succeeded
  [08:03:33Z INFO] JobCompleted Notification
  [08:03:33Z INFO] Received job status event. JobState: Online        ← GitHub acknowledged completion
  [08:03:33Z INFO] Fire signal for one time used runner.
  [08:03:33Z INFO] Job has finished at backend...
  [08:03:33Z INFO] Stop message queue looping.
  [08:03:33Z WARN] GET request to broker.actions.githubusercontent.com/message... has been cancelled.
  [08:03:33Z ERR ] TaskCanceledException: The operation was canceled.  ← Broker sees disconnect
  [08:03:33Z INFO] Job request f1918d06-... processed succeed.
  [08:03:33Z INFO] Deleting Runner Session...
  [08:03:33Z INFO] Runner execution has finished with return code 0

```

All timestamps are identical (08:03:33Z) — there is zero delay between completion acknowledgment and broker teardown.
GitHub UI result: "The self-hosted runner lost communication with the server" — despite the job completing successfully.

Reproduction

  - Use ephemeral JIT runners (Ephemeral: true, UseV2Flow: true)
  - Run any short job (< 60s makes the race more likely)
  - Observe runner logs show successful completion + immediate broker disconnect
  - GitHub UI shows "lost communication" or stays stuck in queued/in_progress

  Tested on runner versions 2.331.0 and 2.333.0 — same behavior.

  Proposed Fix

  Add a brief grace period (e.g., 5 seconds) before cancelling the message queue, allowing GitHub's backend systems to propagate the completion:

```
  runOnceJobCompleted = true;
  Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");

  // Grace period: keep broker connection alive so GitHub's backend can
  // propagate job completion before seeing the runner disconnect.
  await Task.Delay(TimeSpan.FromSeconds(5));

  Trace.Info("Stop message queue looping.");
  messageQueueLoopTokenSource.Cancel();
```

This ensures the broker connection remains healthy while GitHub's pipeline service syncs the completion status to its webhook and UI systems.

Environment
  - Runner versions: 2.331.0, 2.333.0
  - OS: Ubuntu 24.04
  - Mode: Ephemeral JIT runner (--jitconfig, org-level, V2 broker flow)
  - Scale: ~hundreds of jobs/day, issue affects ~5-10% of runs

 Related Issues

  - #3539 — same error message, different root cause (resource starvation)
  - #3981 — same error message, references #2624
  - #2040 — runner shutdown/stoppage, open since 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ephemeral/JIT runner reports "lost communication" despite successful job completion — no grace period before broker disconnect #4309

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ephemeral/JIT runner reports "lost communication" despite successful job completion — no grace period before broker disconnect #4309

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions