Open
Description
Contact Details [Optional]
No response
System Information
Your ZenML client version (0.75.1) does not match the server version (0.75.0). This version mismatch might lead to errors or unexpected behavior.
To disable this warning message, set the environment variable ZENML_DISABLE_CLIENT_SERVER_MISMATCH_WARNING=True
ZENML_LOCAL_VERSION: 0.75.1
ZENML_SERVER_VERSION: 0.75.0
ZENML_SERVER_DATABASE: sqlite
ZENML_SERVER_DEPLOYMENT_TYPE: kubernetes
ZENML_CONFIG_DIR: /home/mathisz/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/mathisz/.config/zenml/local_stores
ZENML_SERVER_URL: http://10.96.30.187:80
ZENML_ACTIVE_REPOSITORY_ROOT: None
PYTHON_VERSION: 3.12.7
ENVIRONMENT: native
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '24.10'}
ACTIVE_WORKSPACE: default
ACTIVE_STACK: minikube_stack
ACTIVE_USER: admin
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: c4456867-5677-40c9-be45-a7da71776e50
ANALYTICS_USER_ID: 4be157a6-5bd1-4012-a096-9268af7da1ae
ANALYTICS_SERVER_ID: 98c512f8-d132-4ec1-9304-8420cec7a407
INTEGRATIONS: ['airflow', 'bitbucket', 'kaniko', 'kubernetes', 'pigeon', 's3']
PACKAGES: {'gitpython': '3.1.44', 'mako': '1.3.9', 'markupsafe': '3.0.2', 'pymysql': '1.1.1', 'pyyaml': '6.0.2', 'sqlalchemy-utils': '0.41.2', 'aiobotocore': '2.21.1',
'aiohappyeyeballs': '2.6.1', 'aiohttp': '3.11.14', 'aioitertools': '0.12.0', 'aiosignal': '1.3.2', 'alembic': '1.8.1', 'annotated-types': '0.7.0', 'argparse': '1.4.0',
'asttokens': '3.0.0', 'attrs': '25.3.0', 'aws-profile-manager': '0.7.3', 'bcrypt': '4.0.1', 'boto3': '1.37.1', 'botocore': '1.37.1', 'cachetools': '5.5.2', 'certifi':
'2025.1.31', 'charset-normalizer': '3.4.1', 'click': '8.1.3', 'cloudpickle': '2.2.1', 'comm': '0.2.2', 'configparser': '7.2.0', 'decorator': '5.2.1', 'distro': '1.9.0',
'docker': '7.1.0', 'executing': '2.2.0', 'frozenlist': '1.5.0', 'fsspec': '2025.3.0', 'gitdb': '4.0.12', 'google-auth': '2.38.0', 'greenlet': '3.1.1', 'idna': '3.10',
'ipython': '9.0.2', 'ipython-pygments-lexers': '1.1.1', 'ipywidgets': '8.1.5', 'jedi': '0.19.2', 'jmespath': '1.0.1', 'jupyterlab-widgets': '3.0.13', 'kubernetes':
'25.3.0', 'markdown-it-py': '3.0.0', 'matplotlib-inline': '0.1.7', 'mdurl': '0.1.2', 'multidict': '6.2.0', 'oauthlib': '3.2.2', 'packaging': '24.2', 'parso': '0.8.4',
'passlib': '1.7.4', 'pexpect': '4.9.0', 'pip': '24.2', 'prompt-toolkit': '3.0.50', 'propcache': '0.3.0', 'psutil': '7.0.0', 'ptyprocess': '0.7.0', 'pure-eval': '0.2.3',
'pyasn1': '0.6.1', 'pyasn1-modules': '0.4.1', 'pydantic': '2.8.2', 'pydantic-core': '2.20.1', 'pydantic-settings': '2.8.1', 'pygments': '2.19.1', 'python-dateutil':
'2.9.0.post0', 'python-dotenv': '1.0.1', 'requests': '2.32.3', 'requests-oauthlib': '2.0.0', 'rich': '13.9.4', 'rsa': '4.9', 's3fs': '2025.3.0', 's3transfer': '0.11.3',
'setuptools': '76.1.0', 'six': '1.17.0', 'smmap': '5.0.2', 'sqlalchemy': '2.0.39', 'sqlmodel': '0.0.18', 'stack-data': '0.6.3', 'traitlets': '5.14.3',
'typing-extensions': '4.12.2', 'urllib3': '2.3.0', 'wcwidth': '0.2.13', 'websocket-client': '1.8.0', 'widgetsnbextension': '4.0.13', 'wrapt': '1.17.2', 'yarl': '1.18.3',
'zenml': '0.75.1', 'autocommand': '2.2.2', 'backports.tarfile': '1.2.0', 'importlib-metadata': '8.0.0', 'inflect': '7.3.1', 'jaraco.collections': '5.1.0',
'jaraco.context': '5.3.0', 'jaraco.functools': '4.0.1', 'jaraco.text': '3.12.1', 'more-itertools': '10.3.0', 'platformdirs': '4.2.2', 'tomli': '2.0.1', 'typeguard':
'4.3.0', 'wheel': '0.43.0', 'zipp': '3.19.2'}
CURRENT STACK
Name: minikube_stack
ID: 885dc2e7-fb0e-4fdc-bf61-715e4c113f67
User: admin / 4be157a6-5bd1-4012-a096-9268af7da1ae
Workspace: default / 7290ef1a-3c21-46a2-b51c-7120a37a8560
ARTIFACT_STORE: minio_store
Name: minio_store
ID: 907c406d-e972-4480-baac-d780d1914afc
Type: artifact_store
Flavor: s3
Configuration: {'authentication_secret': 'minio_access_secrets', 'path': 's3://zenml-artifacts-bucket', 'client_kwargs': {'endpoint_url': 'http://10.106.179.99:9000',
'region_name': 'us-east-1'}}
User: admin / 4be157a6-5bd1-4012-a096-9268af7da1ae
Workspace: default / 7290ef1a-3c21-46a2-b51c-7120a37a8560
ORCHESTRATOR: minikube_orc
Name: minikube_orc
ID: a12f7c32-bafb-4069-95a6-2ce60a68d8c7
Type: orchestrator
Flavor: kubernetes
Configuration: {'synchronous': True, 'kubernetes_context': 'minikube', 'kubernetes_namespace': 'zenml'}
User: admin / 4be157a6-5bd1-4012-a096-9268af7da1ae
Workspace: default / 7290ef1a-3c21-46a2-b51c-7120a37a8560
CONTAINER_REGISTRY: minikube_container_registry
Name: minikube_container_registry
ID: c0aef8ee-5370-449e-ade6-962b7eb0fabf
Type: container_registry
Flavor: default
Configuration: {'uri': '10.109.25.75:80'}
User: admin / 4be157a6-5bd1-4012-a096-9268af7da1ae
Workspace: default / 7290ef1a-3c21-46a2-b51c-7120a37a8560
What happened?
To test what would happen if this occurred in production, I scheduled tasks that would exhaust the memory of the kubernetes cluster. As a result, some of the pipeline steps finished correctly and some remain pending forever. Their pod status is "OOMKilled" but it does not look like they will get retried anytime.
I use minikube for local testing. My cluster has about 14 GB of memory. This is the pipeline code I used:
from zenml import step,
import time
@step
def wait():
_allocated_space = bytearray(4 * 1024**3)
time.sleep(10)
@pipeline(enable_cache=False)
def simple_ml_pipeline():
for i in range(10):
wait()
if __name__=="__main__":
simple_ml_pipeline()
Reproduction steps
- Setup a minikube cluster including S3 storage etc.
- Run the pipeline code I provided above
- See the pipeline status in the dashboard
Relevant log output
Initiating a new run for the pipeline: simple_ml_pipeline.
Reusing existing build 0eb6867e-85d9-4ec0-bd3d-a3d8ad4070e2 for stack minikube_stack.
Archiving pipeline code directory: /home/mathisz/Documents/zenml/code. If this is taking longer than you expected, make sure your source root is set correctly by running zenml init, and that it does not contain unnecessarily huge files.
Uploading code to s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz (Size: 11.86 KiB).
Code upload finished.
Caching is disabled by default for simple_ml_pipeline.
Using a build:
Image(s): 10.109.25.75:80/zenml@sha256:0c30bd89534b31ef98c5c7afbea504bc44dcf742645875513571ba42780669d0
Using user: admin
Using stack: minikube_stack
artifact_store: minio_store
orchestrator: minikube_orc
container_registry: minikube_container_registry
Dashboard URL for Pipeline Run: http://10.96.30.187:80/runs/0971ca8e-e57b-4057-9de8-3e54a72168ae
Waiting for Kubernetes orchestrator pod...
Kubernetes orchestrator pod started.
Waiting for pod of step wait_3 to start...
Waiting for pod of step wait_8 to start...
Waiting for pod of step wait_7 to start...
Waiting for pod of step wait_4 to start...
Waiting for pod of step wait_2 to start...
Waiting for pod of step wait_9 to start...
Waiting for pod of step wait_6 to start...
Waiting for pod of step wait_10 to start...
Waiting for pod of step wait to start...
Waiting for pod of step wait_5 to start...
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
<frozen runpy>:128: RuntimeWarning: 'zenml.entrypoints.entrypoint' found in sys.modules after import of package 'zenml.entrypoints', but prior to execution of 'zenml.entrypoints.entrypoint'; this may result in unpredictable behaviour
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Downloading code from artifact store path s3://zenml-artifacts-bucket/code_uploads/4597a02447001fe6d390c4b9489e95cce10cc605.tar.gz.
Code download finished.
Code download finished.
Code download finished.
Code download finished.
Step wait_2 has started.
Step wait_10 has started.
Step wait_7 has started.
Step wait_6 has started.
Step wait_8 has started.
Step wait_9 has started.
Step wait_3 has started.
Step wait_4 has started.
Step wait_5 has started.
Step wait has started.
Exception in thread Thread-5 (_run_node):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/zenml/orchestrators/dag_runner.py", line 126, in _run_node
self.run_fn(node)
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kubernetes_orchestrator_entrypoint.py", line 155, in run_step_on_kubernetes
kube_utils.wait_pod(
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kube_utils.py", line 284, in wait_pod
raise RuntimeError(f"Pod {namespace}:{pod_name} failed.")
RuntimeError: Pod zenml:simple-ml-pipeline-805804a073925f65ce6cc758b06159cb-wait-4 failed.
Exception in thread Thread-8 (_run_node):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Thread-9 (_run_node):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/zenml/orchestrators/dag_runner.py", line 126, in _run_node
self.run_fn(node)
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kubernetes_orchestrator_entrypoint.py", line 155, in run_step_on_kubernetes
kube_utils.wait_pod(
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kube_utils.py", line 284, in wait_pod
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/zenml/orchestrators/dag_runner.py", line 126, in _run_node
raise RuntimeError(f"Pod {namespace}:{pod_name} failed.")
RuntimeError: Pod zenml:simple-ml-pipeline-805804a073925f65ce6cc758b06159cb-wait-7 failed.
self.run_fn(node)
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kubernetes_orchestrator_entrypoint.py", line 155, in run_step_on_kubernetes
kube_utils.wait_pod(
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kube_utils.py", line 284, in wait_pod
raise RuntimeError(f"Pod {namespace}:{pod_name} failed.")
RuntimeError: Pod zenml:simple-ml-pipeline-805804a073925f65ce6cc758b06159cb-wait-8 failed.
Exception in thread Thread-2 (_run_node):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/zenml/orchestrators/dag_runner.py", line 126, in _run_node
self.run_fn(node)
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kubernetes_orchestrator_entrypoint.py", line 155, in run_step_on_kubernetes
kube_utils.wait_pod(
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kube_utils.py", line 284, in wait_pod
raise RuntimeError(f"Pod {namespace}:{pod_name} failed.")
RuntimeError: Pod zenml:simple-ml-pipeline-805804a073925f65ce6cc758b06159cb-wait-10 failed.
Exception in thread Thread-7 (_run_node):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/zenml/orchestrators/dag_runner.py", line 126, in _run_node
self.run_fn(node)
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kubernetes_orchestrator_entrypoint.py", line 155, in run_step_on_kubernetes
kube_utils.wait_pod(
File "/opt/venv/lib/python3.12/site-packages/zenml/integrations/kubernetes/orchestrators/kube_utils.py", line 284, in wait_pod
raise RuntimeError(f"Pod {namespace}:{pod_name} failed.")
RuntimeError: Pod zenml:simple-ml-pipeline-805804a073925f65ce6cc758b06159cb-wait-6 failed.
Step wait_3 has finished in 14.328s.
Step wait_9 has finished in 16.941s.
Step wait_5 has finished in 14.976s.
Step wait_2 has finished in 17.543s.
Step wait has finished in 11.359s.
Pod of step wait_9 completed.
Pod of step wait_3 completed.
Pod of step wait_5 completed.
Pod of step wait_2 completed.
Pod of step wait completed.
Orchestration pod completed.
Code of Conduct
- I agree to follow this project's Code of Conduct