Skip to content

If any clusters are offline, bundle status can get stuck #594

@philomory

Description

@philomory

If any clusters are offline/unavailable, the status of Bundles that get deployed to those clusters can get stuck with misleading/confusing error messages.

Steps to reproduce:

  1. Create a fleet workspace, and add at least two clusters. Make one cluster an e.g. single-node k3s cluster, so that it is easy to start/stop the cluster. We'll call that cluster "A".
  2. Create a ClusterGroup that contains at least two of your clusters, including cluster A. We'll call this group "G".
  3. Create a git repository containing the following code:
    # test-bundle/fleet.yaml
    defaultNamespace: test-bundle
    helm:
      chart: ./chart
      releaseName: test-bundle
    
    # test-bundle/chart/Chart.yaml
    apiVersion: v2
    name: test-bundle
    version: 0.0.1
    
    # test-bundle/chart/templates/deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test
    spec:
      selector:
        matchLabels:
          app: test
      template:
        metadata:
          labels:
            app: test
        spec:
          containers:
            - name: test
              image: paulbouwer/hello-kubernetes:1.10.1
              lifecycle:
                preStop:
                  exec:
                    command: ["sleep", "1"]
  4. Create a GitRepo object pointing to this repository that applies to ClusterGroup "G". We'll call this GitRepo "R".
  5. Let Fleet deploy the bundle from R to all clusters in G. Wait for them all to be Ready.
  6. Push the following change (which introduces an error) to the git repository R:
    --- a/test-bundle/chart/templates/deployment.yaml
    +++ b/test-bundle/chart/templates/deployment.yaml
    @@ -20,6 +20,6 @@ spec:
             - name: test
               image: paulbouwer/hello-kubernetes:1.10.1
               lifecycle:
    -            preStop:
    +            preStart:
                   exec:
                     command: ["sleep", "2"]
    
  7. Wait for Fleet to attempt to deploy this change. The GitRepo, Bundle, and all relevant BundleDeployments should end up in the state ErrApplied, with an error message similar to error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle'.
  8. Shut off cluster A (stop the only node in the cluster).
  9. Wait until Cluster A shows as offline under Cluster Management in Rancher.
  10. Push the following change (which neither fixes the error nor introduces new errors) to GitRepo G:
    --- a/test-bundle/chart/templates/deployment.yaml
    +++ b/test-bundle/chart/templates/deployment.yaml
    @@ -18,7 +18,7 @@ spec:
         spec:
           containers:
             - name: test
    -          image: paulbouwer/hello-kubernetes:1.10.1
    +          image: paulbouwer/hello-kubernetes:1.10
               lifecycle:
                 preStart:
                   exec:
    
  11. Wait until the GitRepo UI in Fleet shows that it has picked up the new commit. The GitRepo, Bundle, and all relevant BundleDeployments will still be in the same error state as before.
  12. Push the following change (which fixes the initial error) to GitRepo G:
    --- a/test-bundle/chart/templates/deployment.yaml
    +++ b/test-bundle/chart/templates/deployment.yaml
    @@ -20,6 +20,6 @@ spec:
             - name: test
               image: paulbouwer/hello-kubernetes:1.10
               lifecycle:
    -            preStart:
    +            preStop:
                   exec:
                     command: ["sleep", "2"]
    
  13. Wait for the GitRepo UI in Fleet to show that it has picked up the new commit. Shortly, all BundleDeployments except that belonging to A will update to show that they are in a "Ready" state. However, the BundleDeployment for A will remain in the original error state, as will the Bundle and GitRepo.
  14. Observe that, in Fleet's UI, GitRepo R still displays an error message similar to Error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle, even though the actual repository no longer contains any reference to a preStart field.

It is worth noting that, if step 10 is skipped - so that the commit in step 12 (which fixes the error) is the first commit to the repo after cluster A goes offline - then in step 12 the BundleDeployment for A will go to a "Wait Applied" state rather than being stuck in the error state.

Metadata

Metadata

Type

No fields configured for Bug.

Projects

Status

📋 Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions