Can't make it work on kubernetes #13851
-
|
I have kubernetes with two nodes: control plane (doesn't have GPU) and worker node with 4 tesla p100. When I go directly to worker node and just submit my job directly on server it runs fine and uses GPU. The configuration below: However when I try to submit the job via control plane node: I am getting: I literally tried every possible version to build my container still getting the same error. Here is an example of my Dockerfile: My worker node: Also I can see that GPU is exposed to the cluster: Could you advise me what's the issue here and how to debug what exactly mismatch? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
|
The issue is caused by a CUDA version mismatch between your Kubernetes executor container, the Spark RAPIDS plugin, and the GPU driver on your node. Your worker node reports:
Your container is built from:
But the RAPIDS plugin you use:
When running directly on the node, Spark loads the host CUDA libraries and everything works. The real fix is to rebuild the Spark executor image using a CUDA 11.8 base image, which matches the RAPIDS 23.10 release: Rebuild your image, redeploy the executors, and the GPU runtime will match RAPIDS and the host driver, resolving the error. |
Beta Was this translation helpful? Give feedback.
You're very close — the root cause is now visible from your diagnostics.
Your container correctly has CUDA 11.8, but inside Kubernetes your Spark executor is not seeing the host's NVIDIA driver libraries, leading to:
Even though the node has driver 580.105.08 (CUDA 13.0), the executor pod fails to load
libcuda.sofrom the host.The error in your
nvidia-device-pluginpod confirms this:This means the device plugin is not correctly passing GPUs / driver libraries / CDI spec to your pods.
What you need to fix
Below is a checklist of what to adjust.
1. Ensure your contai…