You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is wrong:
A crash/power loss/disk-full during marker rewrite can leave a truncated/corrupt marker. On next start, the agent aborts startup instead of recovering from marker corruption.
Why it matters:
Upgrade is a normal operation path. A single interruption can leave the host agent unavailable until manual repair of internal state, impacting Fleet-managed upgrades at scale.
Suggested fix direction:
Use atomic marker persistence in upgrade marking path (temp file + fsync + rename, reusing existing safe marker-writing primitives where possible).
In startup handleUpgrade, treat malformed marker as recoverable with explicit degraded handling (e.g., move aside invalid marker + continue, while surfacing error state).
Test direction (failing test to add):
Add a startup-path test in internal/pkg/agent/cmd/run_test.go (or equivalent) that places malformed .update-marker in data/ and asserts startup does not abort critical path.
Add upgrade marker write interruption/corruption test around markUpgrade behavior in step_mark_test.go.
2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)
Rollback path starts watcher and returns success immediately:
InvokeWatcher(..., "--rollback", ...) then return nil, nil.
Watcher process completion/error is only logged asynchronously:
go func(){ if err := cmd.Wait(); err != nil { log... } }().
Coordinator ACKs rollback action unconditionally once upgrade call returns:
if uOpts.rollback { return c.upgradeMgr.AckAction(...) }.
Actual rollback execution can fail later in watcher command path with non-zero exit:
watch --rollback calls rollback(...); failures exit with errorRollbackFailed.
What is wrong:
The control plane can receive rollback success ACK before rollback execution has actually succeeded.
Why it matters:
In a real failed-upgrade incident, Fleet/operator may see rollback as complete while the host remains broken, delaying remediation and masking production impact.
Suggested fix direction:
Delay ACK until rollback outcome is confirmed (synchronous wait/IPC status file/explicit completion signal from watcher).
Propagate rollback execution failure back to action result instead of log-only async failure.
Test direction (failing test to add):
Extend internal/pkg/agent/application/upgrade/manual_rollback_test.go + coordinator unit tests to simulate: watcher starts successfully, rollback execution fails immediately; assert action is not ACKed as success.
P1: corrupt marker startup abort (markUpgrade write + handleUpgrade hard-fail) — common upgrade path under crash/power-loss conditions.
Upgrade paths audited and found safe
Watcher startup handshake waits for UPG_WATCHING marker state with timeout (internal/pkg/agent/application/upgrade/watcher.go).
Watcher takeover path uses lock-based coordination with timeout (internal/pkg/agent/application/upgrade/watcher.go).
Symlink rotation in relink path uses safer rotate primitive (internal/pkg/agent/application/upgrade/step_relink.go).
Existing marker access retry handling on Windows reduces transient lock races (internal/pkg/agent/application/upgrade/marker_access_windows.go).
Note
🔒 Integrity filtering filtered 1 item
Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
issue:elastic/elastic-agent#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
Findings
1) Corrupted
.update-markercan prevent agent startup after power loss/disk-full during upgradePriority: P1 (common path: any managed/CLI upgrade)
Platform: Linux, Windows, macOS
Location:
internal/pkg/agent/application/upgrade/step_mark.go:162-166internal/pkg/agent/application/upgrade/upgrade.go:174internal/pkg/agent/cmd/run.go:171-175internal/pkg/agent/cmd/run.go:805-809internal/pkg/agent/cmd/run.go:231-233internal/pkg/agent/application/upgrade/step_mark.go:215-217Evidence:
os.WriteFile:markUpgrade: ... if err := writeFile(markerPath, markerBytes, 0600); err != nil ...NewUpgraderinjectsmarkUpgradeProvider(UpdateActiveCommit, os.WriteFile).handleUpgrade()callsupgrade.LoadMarker(paths.Data())and returns error on unmarshal/read failure.runElasticAgentCritical()collects that error and returns it before normal startup.yaml.Unmarshal(markerBytes, &marker)returns error directly.What is wrong:
A crash/power loss/disk-full during marker rewrite can leave a truncated/corrupt marker. On next start, the agent aborts startup instead of recovering from marker corruption.
Why it matters:
Upgrade is a normal operation path. A single interruption can leave the host agent unavailable until manual repair of internal state, impacting Fleet-managed upgrades at scale.
Suggested fix direction:
handleUpgrade, treat malformed marker as recoverable with explicit degraded handling (e.g., move aside invalid marker + continue, while surfacing error state).Test direction (failing test to add):
internal/pkg/agent/cmd/run_test.go(or equivalent) that places malformed.update-markerindata/and asserts startup does not abort critical path.markUpgradebehavior instep_mark_test.go.2) Manual rollback action can be ACKed before rollback outcome is known (silent remote failure)
Priority: P0 (rollback path after failed upgrade)
Platform: Linux, Windows, macOS
Location:
internal/pkg/agent/application/upgrade/manual_rollback.go:62-83internal/pkg/agent/application/upgrade/rollback_other.go:30-37internal/pkg/agent/application/upgrade/rollback_windows.go:85-92internal/pkg/agent/application/coordinator/coordinator.go:893-896internal/pkg/agent/cmd/watch.go:91-104Evidence:
InvokeWatcher(..., "--rollback", ...)thenreturn nil, nil.go func(){ if err := cmd.Wait(); err != nil { log... } }().if uOpts.rollback { return c.upgradeMgr.AckAction(...) }.watch --rollbackcallsrollback(...); failures exit witherrorRollbackFailed.What is wrong:
The control plane can receive rollback success ACK before rollback execution has actually succeeded.
Why it matters:
In a real failed-upgrade incident, Fleet/operator may see rollback as complete while the host remains broken, delaying remediation and masking production impact.
Suggested fix direction:
Test direction (failing test to add):
internal/pkg/agent/application/upgrade/manual_rollback_test.go+ coordinator unit tests to simulate: watcher starts successfully, rollback execution fails immediately; assert action is not ACKed as success.Priority ranking
manual_rollback+ coordinator ACK path) — directly impacts incident recovery correctness.markUpgradewrite +handleUpgradehard-fail) — common upgrade path under crash/power-loss conditions.Upgrade paths audited and found safe
UPG_WATCHINGmarker state with timeout (internal/pkg/agent/application/upgrade/watcher.go).internal/pkg/agent/application/upgrade/watcher.go).internal/pkg/agent/application/upgrade/step_relink.go).internal/pkg/agent/application/upgrade/marker_access_windows.go).Note
🔒 Integrity filtering filtered 1 item
Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)What is this? | From workflow: Sweeper: Upgrade and Rollback Lifecycle
Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.