Skip to content

[fleet-enrollment-resilience] False re-enrollment loop due Fleet URL host/full-URL mismatch #13301

@github-actions

Description

@github-actions

Findings

1. Re-enrollment decision compares incompatible URL formats and can trigger perpetual re-enrollment

Severity: High

Location:

  • internal/pkg/agent/cmd/container.go:1163-1166
  • internal/pkg/agent/application/enroll/options.go:74-77
  • internal/pkg/remote/client.go:64-73

Evidence:

  • Re-enroll gate uses:
    • storedConfig.Fleet.Client.GetHosts() and then slices.Contains(storedFleetHosts, setupCfg.Fleet.URL) in container.go.
  • During enrollment, EnrollOptions.RemoteConfig() calls remote.NewConfigFromURL(e.URL).
  • NewConfigFromURL stores c.Host = u.Host and c.Protocol = u.Scheme, i.e., host-only storage (fleet:8220) while setup typically provides full URL ((fleet/redacted) or `(fleet/redacted)

This is a direct host-vs-full-URL comparison, so equivalent endpoints can compare unequal.

Failure scenario (realistic):
A Kubernetes/container deployment restarts with FLEET_URL=(fleet/redacted) Stored config from previous successful enrollment contains host fleet:8220. shouldFleetEnroll` returns true on every restart, repeatedly re-enrolling instead of reusing existing enrollment.

Why it matters:

  • Can cause repeated enrollment churn and unstable managed identity behavior.
  • Can leave orphaned/stale agent records server-side and increase Fleet control-plane load.
  • Directly impacts enrollment resilience during routine pod/node restarts and cluster migrations.

Suggested fix direction:
Normalize both sides before comparison in shouldFleetEnroll:

  • Parse setupCfg.Fleet.URL and compare canonical host:port against stored hosts.
  • Normalize trailing slash and default ports (443/80) consistently.
  • Optionally include protocol comparison separately using canonicalized values.

Failing test to add:

  • Package: internal/pkg/agent/cmd
  • Test name: TestShouldFleetEnroll_NormalizedURLDoesNotReenroll
  • Scenario: stored Fleet host is fleet:8220 (with protocol https in stored client config), setup URL is `(fleet/redacted)
  • Expected: shouldFleetEnroll(...) == false.
  • Current behavior: evaluates to true due to raw string mismatch.

Priority ranking

  1. Unrecoverable / repeated enrollment state churn: URL normalization mismatch in re-enrollment gate (finding above).

Communication paths audited and found resilient in this pass

  • Liveness ?failon=degraded handling in internal/pkg/agent/application/monitoring/liveness.go correctly maps degraded/failed state to HTTP 500 when coordinator state indicates unhealthy.
  • Check-in retry pacing uses bounded jitter backoff in the retrier path (internal/pkg/fleetapi/acker/retrier/retrier.go), avoiding tight retry loops.

Notes

I filtered out lower-confidence candidates and only reported the verified high-severity issue above.

Note

🔒 Integrity filtering filtered 2 items

Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

  • issue:elastic/elastic-agent#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
  • resource:list_label (list_label: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)

What is this? | From workflow: Sweeper: Fleet Enrollment and Communication Resilience

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Mar 31, 2026, 9:25 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions