Skip to content

Ephemeral/JIT runner reports "lost communication" despite successful job completion — no grace period before broker disconnect #4309

@cameronnewman

Description

@cameronnewman

Ephemeral (one-time-use) JIT runners that successfully complete a job are frequently reported by GitHub as "The self-hosted runner lost communication with the server," even though:

  1. The worker process exits with code 100 (success)
  2. CompleteJobAsync succeeds — the listener logs "finish job request for job {id} with result: Succeeded"
  3. The broker confirms the state change — "Received job status event. JobState: Online"
  4. The runner cleanly deletes its session and exits with return code 0

The GitHub UI then either:

  • Shows the job stuck as "queued" or "in_progress" for 10-30+ minutes before eventually updating
  • Shows "The self-hosted runner lost communication with the server" despite the job having completed successfully
  • In some cases, never delivers the workflow_job.completed webhook

Root Cause

In Runner.cs line 576, after the one-time-use job completes, the message queue is cancelled immediately with zero grace period:

 // Line 570-576
  Task completeTask = await Task.WhenAny(getNextMessage, jobDispatcher.RunOnceJobCompleted.Task);
  if (completeTask == jobDispatcher.RunOnceJobCompleted.Task)
  {
      runOnceJobCompleted = true;
      Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");
      Trace.Info("Stop message queue looping.");
      messageQueueLoopTokenSource.Cancel();  // <-- immediate teardown, no grace period

This cancels the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message), which severs the TCP connection. GitHub's broker health monitoring detects the disconnect and flags "runner lost communication" — before GitHub's internal pipeline service has propagated the job completion to the webhook/UI systems.

The race is between two independent GitHub backend systems:

  1. Pipeline service — received CompleteJobAsync, knows the job succeeded
  2. Broker health monitor — sees TCP disconnect, flags the runner as lost

When the broker health monitor wins the race, the job is marked as failed/lost despite having completed.

Evidence

Runner diagnostic logs showing the complete successful flow immediately followed by the forced disconnect:

  [08:03:33Z INFO] Worker finished for job f19fed06-... Code: 100
  [08:03:33Z INFO] finish job request for job f19fed06-... with result: Succeeded
  [08:03:33Z INFO] Job X Build completed with result: Succeeded
  [08:03:33Z INFO] JobCompleted Notification
  [08:03:33Z INFO] Received job status event. JobState: Online        ← GitHub acknowledged completion
  [08:03:33Z INFO] Fire signal for one time used runner.
  [08:03:33Z INFO] Job has finished at backend...
  [08:03:33Z INFO] Stop message queue looping.
  [08:03:33Z WARN] GET request to broker.actions.githubusercontent.com/message... has been cancelled.
  [08:03:33Z ERR ] TaskCanceledException: The operation was canceled.  ← Broker sees disconnect
  [08:03:33Z INFO] Job request f1918d06-... processed succeed.
  [08:03:33Z INFO] Deleting Runner Session...
  [08:03:33Z INFO] Runner execution has finished with return code 0

All timestamps are identical (08:03:33Z) — there is zero delay between completion acknowledgment and broker teardown.
GitHub UI result: "The self-hosted runner lost communication with the server" — despite the job completing successfully.

Reproduction

  • Use ephemeral JIT runners (Ephemeral: true, UseV2Flow: true)
  • Run any short job (< 60s makes the race more likely)
  • Observe runner logs show successful completion + immediate broker disconnect
  • GitHub UI shows "lost communication" or stays stuck in queued/in_progress

Tested on runner versions 2.331.0 and 2.333.0 — same behavior.

Proposed Fix

Add a brief grace period (e.g., 5 seconds) before cancelling the message queue, allowing GitHub's backend systems to propagate the completion:

  runOnceJobCompleted = true;
  Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");

  // Grace period: keep broker connection alive so GitHub's backend can
  // propagate job completion before seeing the runner disconnect.
  await Task.Delay(TimeSpan.FromSeconds(5));

  Trace.Info("Stop message queue looping.");
  messageQueueLoopTokenSource.Cancel();

This ensures the broker connection remains healthy while GitHub's pipeline service syncs the completion status to its webhook and UI systems.

Environment

  • Runner versions: 2.331.0, 2.333.0
  • OS: Ubuntu 24.04
  • Mode: Ephemeral JIT runner (--jitconfig, org-level, V2 broker flow)
  • Scale: ~hundreds of jobs/day, issue affects ~5-10% of runs

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions