[upgrade-lifecycle] High-severity upgrade lifecycle failures: corrupt marker startup abort + rollback ACK before execution

## Findings

### 1) Corrupted `.update-marker` can prevent agent startup after power loss/disk-full during upgrade

**Priority:** P1 (common path: any managed/CLI upgrade)

**Platform:** Linux, Windows, macOS

**Location:**
- `internal/pkg/agent/application/upgrade/step_mark.go:162-166`
- `internal/pkg/agent/application/upgrade/upgrade.go:174`
- `internal/pkg/agent/cmd/run.go:171-175`
- `internal/pkg/agent/cmd/run.go:805-809`
- `internal/pkg/agent/cmd/run.go:231-233`
- `internal/pkg/agent/application/upgrade/step_mark.go:215-217`

**Evidence:**
- Upgrade marker write uses non-atomic direct write via `os.WriteFile`:
  - `markUpgrade: ... if err := writeFile(markerPath, markerBytes, 0600); err != nil ...`
  - `NewUpgrader` injects `markUpgradeProvider(UpdateActiveCommit, os.WriteFile)`.
- Startup hard-fails if marker is unreadable:
  - `handleUpgrade()` calls `upgrade.LoadMarker(paths.Data())` and returns error on unmarshal/read failure.
  - `runElasticAgentCritical()` collects that error and returns it before normal startup.
- Marker unmarshal error path:
  - `yaml.Unmarshal(markerBytes, &marker)` returns error directly.

**What is wrong:**
A crash/power loss/disk-full during marker rewrite can leave a truncated/corrupt marker. On next start, the agent aborts startup instead of recovering from marker corruption.

**Why it matters:**
Upgrade is a normal operation path. A single interruption can leave the host agent unavailable until manual repair of internal state, impacting Fleet-managed upgrades at scale.

**Suggested fix direction:**
- Use atomic marker persistence in upgrade marking path (temp file + fsync + rename, reusing existing safe marker-writing primitives where possible).
- In startup `handleUpgrade`, treat malformed marker as recoverable with explicit degraded handling (e.g., move aside invalid marker + continue, while surfacing error state).

**Test direction (failing test to add):**
- Add a startup-path test in `internal/pkg/agent/cmd/run_test.go` (or equivalent) that places malformed `.update-marker` in `data/` and asserts startup does **not** abort critical path.
- Add upgrade marker write interruption/corruption test around `markUpgrade` behavior in `step_mark_test.go`.

---

### 2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)

**Priority:** P0 (rollback path after failed upgrade)

**Platform:** Linux, Windows, macOS

**Location:**
- `internal/pkg/agent/application/upgrade/manual_rollback.go:62-83`
- `internal/pkg/agent/application/upgrade/rollback_other.go:30-37`
- `internal/pkg/agent/application/upgrade/rollback_windows.go:85-92`
- `internal/pkg/agent/application/coordinator/coordinator.go:893-896`
- `internal/pkg/agent/cmd/watch.go:91-104`

**Evidence:**
- Rollback path starts watcher and returns success immediately:
  - `InvokeWatcher(..., "--rollback", ...)` then `return nil, nil`.
- Watcher process completion/error is only logged asynchronously:
  - `go func(){ if err := cmd.Wait(); err != nil { log... } }()`.
- Coordinator ACKs rollback action unconditionally once upgrade call returns:
  - `if uOpts.rollback { return c.upgradeMgr.AckAction(...) }`.
- Actual rollback execution can fail later in watcher command path with non-zero exit:
  - `watch --rollback` calls `rollback(...)`; failures exit with `errorRollbackFailed`.

**What is wrong:**
The control plane can receive rollback success ACK before rollback execution has actually succeeded.

**Why it matters:**
In a real failed-upgrade incident, Fleet/operator may see rollback as complete while the host remains broken, delaying remediation and masking production impact.

**Suggested fix direction:**
- Delay ACK until rollback outcome is confirmed (synchronous wait/IPC status file/explicit completion signal from watcher).
- Propagate rollback execution failure back to action result instead of log-only async failure.

**Test direction (failing test to add):**
- Extend `internal/pkg/agent/application/upgrade/manual_rollback_test.go` + coordinator unit tests to simulate: watcher starts successfully, rollback execution fails immediately; assert action is **not** ACKed as success.

## Priority ranking

1. **P0:** rollback ACK-before-outcome (`manual_rollback` + coordinator ACK path) — directly impacts incident recovery correctness.
2. **P1:** corrupt marker startup abort (`markUpgrade` write + `handleUpgrade` hard-fail) — common upgrade path under crash/power-loss conditions.

## Upgrade paths audited and found safe

- Watcher startup handshake waits for `UPG_WATCHING` marker state with timeout (`internal/pkg/agent/application/upgrade/watcher.go`).
- Watcher takeover path uses lock-based coordination with timeout (`internal/pkg/agent/application/upgrade/watcher.go`).
- Symlink rotation in relink path uses safer rotate primitive (`internal/pkg/agent/application/upgrade/step_relink.go`).
- Existing marker access retry handling on Windows reduces transient lock races (`internal/pkg/agent/application/upgrade/marker_access_windows.go`).




> [!NOTE]
> <details>
> <summary>🔒 Integrity filtering filtered 1 item</summary>
>
> Integrity filtering activated and filtered the following item during workflow execution.
> This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
>
> - issue:elastic/elastic-agent#unknown (`search_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
>
> </details>


---
[What is this?](https://ela.st/github-ai-tools) | [From workflow: Sweeper: Upgrade and Rollback Lifecycle](https://github.com/elastic/elastic-agent/actions/runs/23518190942)

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.
> - [x] expires  on Apr 1, 2026, 12:16 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[upgrade-lifecycle] High-severity upgrade lifecycle failures: corrupt marker startup abort + rollback ACK before execution #13306

Findings

1) Corrupted `.update-marker` can prevent agent startup after power loss/disk-full during upgrade

2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)

Priority ranking

Upgrade paths audited and found safe

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[upgrade-lifecycle] High-severity upgrade lifecycle failures: corrupt marker startup abort + rollback ACK before execution #13306

Description

Findings

1) Corrupted .update-marker can prevent agent startup after power loss/disk-full during upgrade

2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)

Priority ranking

Upgrade paths audited and found safe

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1) Corrupted `.update-marker` can prevent agent startup after power loss/disk-full during upgrade