Skip to content

[bug] Fresh installation fails with CrashLoopBackOff due to race conditions during startup #12702

@DayyanALi

Description

@DayyanALi

Environment

  • How do you deploy Kubeflow Pipelines (KFP)?
    Local cluster using kind

  • KFP version:

    • v2.5.0
      Also tested on master branch and got same issue
  • KFP SDK version:
    kfp 2.15.2
    kfp-pipeline-spec 2.15.2
    kfp-server-api 2.15.2

Kubernetes: Kind cluster v1.31.0 (Docker Desktop with WSL2 backend on Windows 11)

Steps to reproduce

Deploy Kubeflow Pipelines

export PIPELINE_VERSION=2.5.0

kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION" 

Observe multiple pods entering CrashLoopBackOff before eventually recovering with many restarts

Image

Expected result

All pods should start successfully on first attempt without entering CrashLoopBackOff.

Materials and reference

I saw similar issues related to this such as: #6277, #11355, #12034

Root cause analysis:

Race conditions: Components like metadata-grpc start before MySQL is fully ready to accept connections. Similarly, metadata-writer starts before metadata-grpc, and ml-pipeline-persistenceagent / ml-pipeline-scheduledworkflow start before the ml-pipeline API server.

Lack of explicit dependency ordering: The manifests do not enforce startup dependencies, which allows components to start before their required services are ready.


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions