Skip to content

[upgrade-lifecycle] High-severity upgrade lifecycle failures: corrupt marker startup abort + rollback ACK before execution #13306

@github-actions

Description

@github-actions

Findings

1) Corrupted .update-marker can prevent agent startup after power loss/disk-full during upgrade

Priority: P1 (common path: any managed/CLI upgrade)

Platform: Linux, Windows, macOS

Location:

  • internal/pkg/agent/application/upgrade/step_mark.go:162-166
  • internal/pkg/agent/application/upgrade/upgrade.go:174
  • internal/pkg/agent/cmd/run.go:171-175
  • internal/pkg/agent/cmd/run.go:805-809
  • internal/pkg/agent/cmd/run.go:231-233
  • internal/pkg/agent/application/upgrade/step_mark.go:215-217

Evidence:

  • Upgrade marker write uses non-atomic direct write via os.WriteFile:
    • markUpgrade: ... if err := writeFile(markerPath, markerBytes, 0600); err != nil ...
    • NewUpgrader injects markUpgradeProvider(UpdateActiveCommit, os.WriteFile).
  • Startup hard-fails if marker is unreadable:
    • handleUpgrade() calls upgrade.LoadMarker(paths.Data()) and returns error on unmarshal/read failure.
    • runElasticAgentCritical() collects that error and returns it before normal startup.
  • Marker unmarshal error path:
    • yaml.Unmarshal(markerBytes, &marker) returns error directly.

What is wrong:
A crash/power loss/disk-full during marker rewrite can leave a truncated/corrupt marker. On next start, the agent aborts startup instead of recovering from marker corruption.

Why it matters:
Upgrade is a normal operation path. A single interruption can leave the host agent unavailable until manual repair of internal state, impacting Fleet-managed upgrades at scale.

Suggested fix direction:

  • Use atomic marker persistence in upgrade marking path (temp file + fsync + rename, reusing existing safe marker-writing primitives where possible).
  • In startup handleUpgrade, treat malformed marker as recoverable with explicit degraded handling (e.g., move aside invalid marker + continue, while surfacing error state).

Test direction (failing test to add):

  • Add a startup-path test in internal/pkg/agent/cmd/run_test.go (or equivalent) that places malformed .update-marker in data/ and asserts startup does not abort critical path.
  • Add upgrade marker write interruption/corruption test around markUpgrade behavior in step_mark_test.go.

2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)

Priority: P0 (rollback path after failed upgrade)

Platform: Linux, Windows, macOS

Location:

  • internal/pkg/agent/application/upgrade/manual_rollback.go:62-83
  • internal/pkg/agent/application/upgrade/rollback_other.go:30-37
  • internal/pkg/agent/application/upgrade/rollback_windows.go:85-92
  • internal/pkg/agent/application/coordinator/coordinator.go:893-896
  • internal/pkg/agent/cmd/watch.go:91-104

Evidence:

  • Rollback path starts watcher and returns success immediately:
    • InvokeWatcher(..., "--rollback", ...) then return nil, nil.
  • Watcher process completion/error is only logged asynchronously:
    • go func(){ if err := cmd.Wait(); err != nil { log... } }().
  • Coordinator ACKs rollback action unconditionally once upgrade call returns:
    • if uOpts.rollback { return c.upgradeMgr.AckAction(...) }.
  • Actual rollback execution can fail later in watcher command path with non-zero exit:
    • watch --rollback calls rollback(...); failures exit with errorRollbackFailed.

What is wrong:
The control plane can receive rollback success ACK before rollback execution has actually succeeded.

Why it matters:
In a real failed-upgrade incident, Fleet/operator may see rollback as complete while the host remains broken, delaying remediation and masking production impact.

Suggested fix direction:

  • Delay ACK until rollback outcome is confirmed (synchronous wait/IPC status file/explicit completion signal from watcher).
  • Propagate rollback execution failure back to action result instead of log-only async failure.

Test direction (failing test to add):

  • Extend internal/pkg/agent/application/upgrade/manual_rollback_test.go + coordinator unit tests to simulate: watcher starts successfully, rollback execution fails immediately; assert action is not ACKed as success.

Priority ranking

  1. P0: rollback ACK-before-outcome (manual_rollback + coordinator ACK path) — directly impacts incident recovery correctness.
  2. P1: corrupt marker startup abort (markUpgrade write + handleUpgrade hard-fail) — common upgrade path under crash/power-loss conditions.

Upgrade paths audited and found safe

  • Watcher startup handshake waits for UPG_WATCHING marker state with timeout (internal/pkg/agent/application/upgrade/watcher.go).
  • Watcher takeover path uses lock-based coordination with timeout (internal/pkg/agent/application/upgrade/watcher.go).
  • Symlink rotation in relink path uses safer rotate primitive (internal/pkg/agent/application/upgrade/step_relink.go).
  • Existing marker access retry handling on Windows reduces transient lock races (internal/pkg/agent/application/upgrade/marker_access_windows.go).

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

  • issue:elastic/elastic-agent#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)

What is this? | From workflow: Sweeper: Upgrade and Rollback Lifecycle

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Apr 1, 2026, 12:16 AM UTC

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions