Skip to content
Discussion options

You must be logged in to vote

You're very close — the root cause is now visible from your diagnostics.
Your container correctly has CUDA 11.8, but inside Kubernetes your Spark executor is not seeing the host's NVIDIA driver libraries, leading to:


cudaErrorInsufficientDriver

Even though the node has driver 580.105.08 (CUDA 13.0), the executor pod fails to load libcuda.so from the host.
The error in your nvidia-device-plugin pod confirms this:

failed to inject CDI devices: unresolvable CDI devices runtime.nvidia.com/gpu=all
unknown

This means the device plugin is not correctly passing GPUs / driver libraries / CDI spec to your pods.


What you need to fix

Below is a checklist of what to adjust.


1. Ensure your contai…

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@sgindeed
Comment options

@arturzangiev
Comment options

@arturzangiev
Comment options

@sgindeed
Comment options

Answer selected by sameerz
@arturzangiev
Comment options

@sgindeed
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants