Skip to content

[fleet-enrollment-resilience] Trailing-slash Fleet URL mismatch can trigger container re-enrollment loops #14872

@github-actions

Description

@github-actions

Findings

1. shouldFleetEnroll compares full Fleet URL strings without canonical normalization

Priority: P0 (unrecoverable re-enrollment loop risk)

Location

  • internal/pkg/agent/cmd/container.go:1212-1229
  • internal/pkg/agent/cmd/container.go:1237-1241
  • Re-enroll trigger site: internal/pkg/agent/cmd/container.go:304-309

Evidence

At re-enroll decision time, the code checks:

  • matchedFullURL := slices.Contains(storedFleetHosts, setupCfg.Fleet.URL)
  • matchedHostOnly := slices.Contains(storedFleetHosts, setupFleetHost)
  • if !matchedFullURL && !matchedHostOnly { return true, nil }

This is raw string comparison for full URLs. In the post-policy-update layout, stored hosts are full URLs (covered by test case in internal/pkg/agent/cmd/container_test.go:509-534) and host-only fallback does not apply.

A concrete mismatch case that currently re-enrolls:

  • stored host: (host1/redacted)
  • setup FLEET_URL: `(host1/redacted)

These are semantically the same endpoint, but matchedFullURL is false and matchedHostOnly is also false (stored value is full URL, not host-only), so the agent decides enrollment is required.

What is wrong

Equivalent Fleet endpoints can be treated as different due to non-canonical URL comparison (trailing slash, and similarly canonicalization-sensitive forms like explicit default ports).

Why it matters

In container mode, this decision is used at startup (runContainerCmd), so equivalent-but-differently-formatted Fleet URLs can repeatedly trigger re-enrollment on restarts. This causes credential churn and enrollment instability in normal operations (e.g., user-provided env var formatting drift), matching the re-enrollment loop bug class.

Suggested fix direction

Normalize both stored and setup Fleet URLs before comparison (parse + canonicalize scheme/host/port/path-slash) and compare canonical endpoint identity instead of raw strings. Keep protocol-change checks for pre-policy-update host-only layout.


Suggested Actions

  • Add canonical URL normalization helper used by shouldFleetEnroll before host matching.
  • Add a regression test for stored (host1/redacted) vs setup (host1/redacted) expecting shouldFleetEnroll == false`.
  • Add regression tests for default-port normalization ((host1/redacted) vs (host1/redacted), (host1/redacted) vs (host1/redacted) in post-policy-update full-URL layout.

Communication paths audited and found resilient

  • Check-in elapsed-time handling uses monotonic-safe time.Since in internal/pkg/fleetapi/checkin_cmd.go.
  • Unauthorized scheduler switch/reset behavior is covered by internal/pkg/agent/application/gateway/fleet/fleet_gateway_test.go:704-785.
  • Liveness failon=degraded|failed semantics are covered in internal/pkg/agent/application/monitoring/liveness.go and liveness_test.go.
  • Enrollment retry/backoff path in internal/pkg/agent/application/enroll/enroll.go handles transient network/server classes with backoff.

What is this? | From workflow: Sweeper: Fleet Enrollment and Communication Resilience

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Jun 18, 2026, 10:43 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions