-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
What's the issue?
If you are launching a pod with init containers via PipesK8sClient.run, and the pod encounters an error in the init container, pipes will wait indefinitely (or, until the default 1 day timeout) for the pod to be ready, even though it never will be.
What did you expect to happen?
I expect Dagster to raise a DagsterK8sError
and fail the job.
How to reproduce?
Create a pod spec with an init_container
that exits 1
. Launch that pod via PipesK8sClient.run
. When the pod gets status Init:Error
, notice that the Dagster run will continue without noticing the error.
Dagster version
1.10.11
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
In the PR #24313 I added lots of code for error handling for K8S pods on the Pipes side. We can see it handles recoverable errors and waits for retry, and raises DagsterK8sError
for unrecoverable errors such as RunContainerError
, ErrImagePull
, etc.
However, if the init container exits with Init:Error
, Dagster doesn't pick up on that unrecoverable error and does not kill the pod.
Kubernetes docs do not do a good job of publishing all possible container statuses, which is a bit annoying. I can't find a comprehensive list anywhere. The current (unrecoverable) list we watch for is this:
elif state.waiting.reason in [
KubernetesWaitingReasons.ErrImagePull,
KubernetesWaitingReasons.ImagePullBackOff,
KubernetesWaitingReasons.CrashLoopBackOff,
KubernetesWaitingReasons.RunContainerError,
]:
According to the docs, we may simply need to add Init:Error
to this list.
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.