Skip to content

Workload cluster upgrade e2e tests flake on new check for Cluster's Available condition #6068

@nojnhuh

Description

@nojnhuh

Which jobs are flaky:

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=Remote%20connection%20probe%20failed&job=azure.*workload-upgrade

e.g. https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-31-1-32-main/2014150140197081088

{Timed out after 300.000s.
Failed to verify Cluster Available condition for k8s-upgrade-and-conformance-rmrxf7/k8s-upgrade-and-conformance-1fx44p
The function passed to Eventually failed at /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.11.5/framework/cluster_helpers.go:457 with:
The Available condition on the Cluster should be set to true; message: * RemoteConnectionProbe: Remote connection probe failed, probe last succeeded at 2026-01-22T02:16:41Z
Expected
    <v1.ConditionStatus>: False
to equal
    <v1.ConditionStatus>: True failed [FAILED] Timed out after 300.000s.
Failed to verify Cluster Available condition for k8s-upgrade-and-conformance-rmrxf7/k8s-upgrade-and-conformance-1fx44p
The function passed to Eventually failed at /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.11.5/framework/cluster_helpers.go:457 with:
The Available condition on the Cluster should be set to true; message: * RemoteConnectionProbe: Remote connection probe failed, probe last succeeded at 2026-01-22T02:16:41Z
Expected
    <v1.ConditionStatus>: False
to equal
    <v1.ConditionStatus>: True
In [It] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.11.5/framework/cluster_helpers.go:463 @ 01/22/26 02:22:33.833
}

Which tests are flaky:

Testgrid link:

Reason for failure (if possible):

This check was added to the CAPI e2e test framework in kubernetes-sigs/cluster-api#12111.

Unclear yet if there's a real problem in CAPZ or something needs to be tuned for the tests. The --remote-connection-grace-period command line argument to capi-controller-manager is the only relevant setting I think we can control. The timeouts in the test are hardcoded.

Anything else we need to know:

  • links to go.k8s.io/triage appreciated
  • links to specific failures in spyglass appreciated

/kind flake

[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/flakeCategorizes issue or PR as related to a flaky test.

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions