Can't make it work on kubernetes #13851

arturzangiev · 2025-11-22T15:30:59Z

arturzangiev
Nov 22, 2025

I have kubernetes with two nodes: control plane (doesn't have GPU) and worker node with 4 tesla p100. When I go directly to worker node and just submit my job directly on server it runs fine and uses GPU. The configuration below:

/home/artur/spark/bin/spark-submit \
       --master local \
       --num-executors 1 \
       --conf spark.executor.cores=1 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --driver-memory 10g \
       --conf spark.rapids.memory.pinnedPool.size=2G \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --jars rapids-4-spark_2.12-23.10.0.jar \
       --class org.apache.spark.examples.SparkPi \
       /home/artur/spark/examples/jars/spark-examples_2.12-3.5.0.jar

However when I try to submit the job via control plane node:

spark-shell \
     --master k8s://https://192.168.0.118:6443 \
     --name mysparkshell \
     --deploy-mode client  \
     --conf spark.driver.bindAddress=0.0.0.0 \
     --conf spark.driver.host=192.168.0.118 \
     --conf spark.executor.instances=1 \
     --conf spark.executor.resource.gpu.amount=1 \
     --conf spark.kubernetes.executor.limit.nvidia.com/gpu=1 \
     --conf spark.executor.memory=4G \
     --conf spark.executor.cores=1 \
     --conf spark.task.cpus=1 \
     --conf spark.task.resource.gpu.amount=1 \
     --conf spark.rapids.memory.pinnedPool.size=2G \
     --conf spark.executor.memoryOverhead=3G \
     --conf spark.sql.files.maxPartitionBytes=512m \
     --conf spark.sql.shuffle.partitions=10 \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.kubernetes.namespace=spark  \
     --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
     --conf spark.executor.resource.gpu.vendor=nvidia.com \
     --conf spark.kubernetes.container.image=110805684616.dkr.ecr.eu-west-1.amazonaws.com/spark-rapids:1 \
     --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_2.12-23.10.0.jar \
     --driver-class-path=./rapids-4-spark_2.12-23.10.0.jar \
     --conf spark.kubernetes.trust.certificates=true \
     --conf spark.kubernetes.executor.deleteOnTermination=false \
     --driver-memory 2G

I am getting:

25/11/22 15:19:24 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-11-cuda11/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:206: 35 cudaErrorInsufficientDriver CUDA driver version is insufficient for CUDA runtime version

I literally tried every possible version to build my container still getting the same error.

Here is an example of my Dockerfile:

FROM nvidia/cuda:12.4.0-devel-ubuntu20.04
ARG spark_uid=185

# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Install java dependencies
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the Docker image, first either download Apache Spark 3.1+ from
# http://spark.apache.org/downloads.html or build and make a Spark distribution following the
# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see
# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions).  If this
# Docker file is being used in the context of building your images from a Spark distribution, the
# Docker build command should be invoked from the top level directory of the Spark
# distribution. For example, Docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/jars && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    mkdir -p /opt/sparkRapidsPlugin && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd

COPY spark/jars /opt/spark/jars
COPY spark/bin /opt/spark/bin
COPY spark/sbin /opt/spark/sbin
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/examples /opt/spark/examples
COPY spark/kubernetes/tests /opt/spark/tests
COPY spark/data /opt/spark/data

COPY rapids-4-spark_2.12-*.jar /opt/sparkRapidsPlugin
COPY getGpusResources.sh /opt/sparkRapidsPlugin

RUN chmod +x /opt/sparkRapidsPlugin/getGpusResources.sh

RUN mkdir /opt/spark/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apt-get update && \
    apt install -y python wget && wget https://bootstrap.pypa.io/pip/2.7/get-pip.py && python get-pip.py && \
    apt install -y python3 python3-pip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY spark/python/pyspark /opt/spark/python/pyspark
COPY spark/python/lib /opt/spark/python/lib

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
RUN chmod +rx /usr/bin/tini

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}

My worker node:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-12GB           On  |   00000000:03:00.0 Off |                    0 |
| N/A   26C    P0             23W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P100-PCIE-12GB           On  |   00000000:04:00.0 Off |                    0 |
| N/A   24C    P0             23W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P100-PCIE-12GB           On  |   00000000:82:00.0 Off |                    0 |
| N/A   28C    P0             25W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla P100-PCIE-12GB           On  |   00000000:83:00.0 Off |                    0 |
| N/A   26C    P0             24W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Also I can see that GPU is exposed to the cluster:

Capacity:
  cpu:                24
  ephemeral-storage:  727384632Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264028824Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                24
  ephemeral-storage:  670357675742
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263926424Ki
  nvidia.com/gpu:     4
  pods:               110

Could you advise me what's the issue here and how to debug what exactly mismatch?

Answered by sgindeed

Nov 22, 2025

You're very close — the root cause is now visible from your diagnostics.
Your container correctly has CUDA 11.8, but inside Kubernetes your Spark executor is not seeing the host's NVIDIA driver libraries, leading to:


cudaErrorInsufficientDriver

Even though the node has driver 580.105.08 (CUDA 13.0), the executor pod fails to load libcuda.so from the host.
The error in your nvidia-device-plugin pod confirms this:

failed to inject CDI devices: unresolvable CDI devices runtime.nvidia.com/gpu=all
unknown

This means the device plugin is not correctly passing GPUs / driver libraries / CDI spec to your pods.

What you need to fix

Below is a checklist of what to adjust.

1. Ensure your contai…

View full answer

sgindeed · 2025-11-22T16:21:07Z

sgindeed
Nov 22, 2025

The issue is caused by a CUDA version mismatch between your Kubernetes executor container, the Spark RAPIDS plugin, and the GPU driver on your node.

Your worker node reports:

Driver Version: 580.105.08
CUDA Version: 13.0 (driver-level)

Your container is built from:

nvidia/cuda:12.4.0-devel-ubuntu20.04 → CUDA 12.4 runtime

But the RAPIDS plugin you use:

rapids-4-spark_2.12-23.10.0.jar
is built for CUDA 11.8, not CUDA 12 or CUDA 13.

When running directly on the node, Spark loads the host CUDA libraries and everything works.
Inside Kubernetes, the executor pod loads CUDA 12.4 from your container, while RAPIDS expects CUDA 11.x, which produces:


cudaErrorInsufficientDriver
CUDA driver version is insufficient for CUDA runtime version

The real fix is to rebuild the Spark executor image using a CUDA 11.8 base image, which matches the RAPIDS 23.10 release:


FROM nvidia/cuda:11.8.0-devel-ubuntu20.04

Rebuild your image, redeploy the executors, and the GPU runtime will match RAPIDS and the host driver, resolving the error.

7 replies

sgindeed Nov 22, 2025

If your original container already used CUDA 11.8 and you still saw the same error, then the issue isn’t the base image — it’s which CUDA libraries Spark + RAPIDS actually load at runtime.

Check the following:

• Confirm the pod truly uses CUDA 11.8 and not a hidden CUDA 12.x layer:
ls -l /usr/local | grep cuda
nvcc --version
cat /usr/local/cuda/version.txt

• Ensure the Kubernetes device plugin correctly mounts host driver libraries:
/usr/lib/x86_64-linux-gnu/libcuda.so
Missing or overridden driver libs cause cudaErrorInsufficientDriver.

• Verify no Spark configs are forcing CUDA 12.x:
spark.executorEnv.LD_LIBRARY_PATH
spark.executor.extraLibraryPath

• RAPIDS must see a consistent major CUDA version across:
– host driver
– device plugin libs
– container CUDA libs
Any mismatch between these is enough to trigger the exact error you're getting.

Next step: inspect the executor pod’s runtime environment. Most issues come from an unexpected CUDA library shadowing the correct one.

arturzangiev Nov 22, 2025
Author

This is my output within the container:

185@temp-bash-pod:/opt/spark/work-dir$ ls -l /usr/local | grep cuda
lrwxrwxrwx 1 root root   22 Nov 10  2023 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root   25 Nov 10  2023 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 1 root root 4096 Nov 10  2023 cuda-11.8

185@temp-bash-pod:/opt/spark/work-dir$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Feels like that's where my problem is:

root@debian:/home/artur# kubectl describe pod nvidia-device-plugin-daemonset-mf7ng -n kube-system
Name:                 nvidia-device-plugin-daemonset-mf7ng
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      default
Node:                 datascience/192.168.0.122
Start Time:           Fri, 21 Nov 2025 12:24:37 +0000
Labels:               controller-revision-hash=76bd6fd886
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 8a3d0c6f7a248baf818340a2904b2bc0718ea86a78339ef408566902b64f8389
                      cni.projectcalico.org/podIP: 172.16.229.254/32
                      cni.projectcalico.org/podIPs: 172.16.229.254/32
Status:               Running
IP:                   172.16.229.254
IPs:
  IP:           172.16.229.254
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   containerd://bba152590e043f97a9b5138459ac61c07e009f3f52e78e479b1f1d481e7ea8c4
    Image:          nvcr.io/nvidia/k8s-device-plugin:v0.13.0
    Image ID:       nvcr.io/nvidia/k8s-device-plugin@sha256:e8343db286ac349f213d7b84e65c0d559d6310e74446986a09b66b21913eef12
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 21 Nov 2025 18:50:16 +0000
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices runtime.nvidia.com/gpu=all: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Fri, 21 Nov 2025 18:47:36 +0000
    Ready:          True
    Restart Count:  2
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2tfd7 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-2tfd7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>

Could you direct me how to configure it correctly?

arturzangiev Nov 22, 2025
Author

Maybe you can provide a direction on how to:
Ensure the Kubernetes device plugin correctly mounts host driver libraries:
/usr/lib/x86_64-linux-gnu/libcuda.so
Missing or overridden driver libs cause cudaErrorInsufficientDriver.

sgindeed Nov 22, 2025

You're very close — the root cause is now visible from your diagnostics.
Your container correctly has CUDA 11.8, but inside Kubernetes your Spark executor is not seeing the host's NVIDIA driver libraries, leading to:


cudaErrorInsufficientDriver

Even though the node has driver 580.105.08 (CUDA 13.0), the executor pod fails to load libcuda.so from the host.
The error in your nvidia-device-plugin pod confirms this:

failed to inject CDI devices: unresolvable CDI devices runtime.nvidia.com/gpu=all
unknown

This means the device plugin is not correctly passing GPUs / driver libraries / CDI spec to your pods.

What you need to fix

Below is a checklist of what to adjust.

1. Ensure your container is not overriding host CUDA drivers

Your container currently contains:

/usr/local/cuda-11.8

This is fine only if you allow the device plugin to mount the host driver libs:

Expected mounts in the executor pod:

/dev/nvidia0  
/dev/nvidiactl  
/dev/nvidia-uvm  
/usr/lib/x86_64-linux-gnu/libcuda.so.X.Y  <-- host driver
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

If those are missing, the container falls back to its own runtime libraries → mismatch → error.

2. Verify the newer NVIDIA Kubernetes stack (driver + toolkit + plugin)

Inside your node, confirm:

nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd

Then reinstall the k8s device plugin (correct version for 580 driver):

# deploy the daemonset
kubectl apply -f \
  https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/nvidia-device-plugin.yml

Version v0.15.0 is required for CDI-based injection, which your cluster is trying to use but failing.

Your logs show exactly this failure.

3. Confirm CDI is enabled on the node

Check:

nvidia-ctk cdi list

You must see CDI device definitions like:

runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1

If empty → the device plugin cannot inject drivers → your Spark pod gets no GPU.

Generate CDI:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
sudo systemctl restart containerd

4. Validate inside a test pod

Run:

kubectl run cuda-test --rm -it \
  --image=nvidia/cuda:11.8.0-base-ubuntu20.04 \
  --overrides='{"spec":{"runtimeClassName":"nvidia"}}' \
  -- bash

Inside:

nvidia-smi
ldconfig -p | grep libcuda

If this fails, Spark will fail too.

Final direction

You do not have a CUDA mismatch anymore — you have a driver injection issue caused by a broken/old NVIDIA device plugin.

Once you:

Switch to nvidia/k8s-device-plugin:v0.15.0
Regenerate CDI specs with nvidia-ctk
Confirm libcuda.so from host is mounted into Spark pods

→ your CUDA runtime errors will disappear.

Answer selected by sameerz

arturzangiev Nov 23, 2025
Author

At the end I have reinstalled plugin with this config and everything seem to work fine. Thanks for your help.

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
        - name: LD_LIBRARY_PATH
          value: /usr/lib/x86_64-linux-gnu
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: nvidia-libs
          mountPath: /usr/lib/x86_64-linux-gnu
          readOnly: true
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: nvidia-libs
        hostPath:
          path: /lib/x86_64-linux-gnu
EOF

sgindeed Nov 23, 2025

Great news 👍🏻

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't make it work on kubernetes #13851

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can't make it work on kubernetes #13851

Uh oh!

arturzangiev Nov 22, 2025

What you need to fix

1. Ensure your contai…

Replies: 1 comment · 7 replies

Uh oh!

sgindeed Nov 22, 2025

Uh oh!

sgindeed Nov 22, 2025

Uh oh!

arturzangiev Nov 22, 2025 Author

Uh oh!

arturzangiev Nov 22, 2025 Author

Uh oh!

sgindeed Nov 22, 2025

What you need to fix

1. Ensure your container is not overriding host CUDA drivers

2. Verify the newer NVIDIA Kubernetes stack (driver + toolkit + plugin)

3. Confirm CDI is enabled on the node

4. Validate inside a test pod

Final direction

Uh oh!

arturzangiev Nov 23, 2025 Author

Uh oh!

sgindeed Nov 23, 2025

arturzangiev
Nov 22, 2025

Replies: 1 comment 7 replies

sgindeed
Nov 22, 2025

arturzangiev Nov 22, 2025
Author

arturzangiev Nov 22, 2025
Author

arturzangiev Nov 23, 2025
Author