GPUs require special drivers and software which are not pre-installed on Dataproc clusters by default. This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and -w nodes in a Dataproc cluster.
A default version will be selected from NVIDIA's guidance, similar to the NVIDIA Deep Learning Frameworks Support Matrix, for CUDA, the NVIDIA kernel driver, cuDNN, and NCCL.
Specifying a supported value for the cuda-version metadata variable
will select compatible values for Driver, cuDNN, and NCCL from the script's
internal matrix. Default CUDA versions are typically:
- Dataproc 1.5:
11.6.2 - Dataproc 2.0:
12.1.1 - Dataproc 2.1:
12.4.1 - Dataproc 2.2 & 2.3:
12.6.3
(Note: The script supports a wider range of specific versions.
Refer to internal arrays in install_gpu_driver.sh for the full matrix.)
Example Tested Configurations (Illustrative):
| CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Versions |
|---|---|---|---|---|---|
| 11.8 | 11.8.0 | 525.147.05 | 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11, Ubuntu 22.04) |
| 12.0 | 12.0.1 | 525.147.05 | 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11/12, Rocky 9, Ubuntu 22.04) |
| 12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Debian 10/11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04) |
| 12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Debian 11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04) |
Note: Secure Boot is only supported on Dataproc 2.2+ images.
Supported Operating Systems:
- Debian 10, 11, 12
- Ubuntu 18.04, 20.04, 22.04 LTS
- Rocky Linux 8, 9
This initialization action will install NVIDIA GPU drivers and the CUDA toolkit. Optional components like cuDNN, NCCL, and PyTorch can be included via metadata.
-
Use the
gcloudcommand to create a new cluster with this initialization action. The following command will create a new cluster named<CLUSTER_NAME>and install default GPU drivers (GPU agent is enabled by default).REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-debian12 gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --master-accelerator type=nvidia-tesla-t4,count=1 \ --worker-accelerator type=nvidia-tesla-t4,count=2 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write # For GPU agent
-
Use the
gcloudcommand to create a new cluster specifying a custom CUDA version and providing direct HTTP/HTTPS URLs for the driver and CUDA.runfiles. This example also disables the GPU agent.REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-ubuntu22 MY_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run" MY_CUDA_URL="https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run" gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --master-accelerator type=nvidia-tesla-t4,count=1 \ --worker-accelerator type=nvidia-tesla-t4,count=2 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --metadata gpu-driver-url=${MY_DRIVER_URL},cuda-url=${MY_CUDA_URL},install-gpu-agent=false
-
To create a cluster with Multi-Instance GPU (MIG) enabled (e.g., for NVIDIA A100 GPUs), you must use this
install_gpu_driver.shscript for the base driver installation, and additionally specifygpu/mig.shas a startup script.REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-rocky9 gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --worker-machine-type a2-highgpu-1g \ --worker-accelerator type=nvidia-tesla-a100,count=1 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --properties "dataproc:startup.script.uri=gs://goog-dataproc-initialization-actions-${REGION}/gpu/mig.sh" \ --metadata MIG_CGI='1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb' # Example MIG profiles
When this install_gpu_driver.sh script is used as a customization-script
for building custom Dataproc images (e.g., with tools from the
GoogleCloudDataproc/custom-images repository like generate_custom_image.py),
some configurations need to be deferred.
- The image building tool should pass the metadata
--metadata invocation-type=custom-imagesto the temporary instance used during image creation. - This instructs
install_gpu_driver.shto install drivers and tools but defer Hadoop/Spark-specific configurations to the first boot of an instance created from this custom image. This is handled via a systemd service (dataproc-gpu-config.service). - End-users creating clusters from such a custom image do not set
the
invocation-typemetadata.
Example command for generate_custom_image.py (simplified):
python generate_custom_image.py \
# ... other generate_custom_image.py arguments ...
--customization-script gs://<your-bucket>/gpu/install_gpu_driver.sh \
--metadata invocation-type=custom-images,cuda-version=12.6 # Plus other desired metadataThis script configures YARN, Dataproc's default Resource Manager, for GPU awareness.
- It sets
yarn.io/gpuas a resource type. - It configures the
LinuxContainerExecutorand cgroups for GPU isolation. - It installs a GPU discovery script (
getGpusResources.sh) for Spark, which caches results to minimizenvidia-smicalls. - Spark default configurations in
/etc/spark/conf/spark-defaults.confare updated with GPU-related properties (e.g.,spark.executor.resource.gpu.amount) and the RAPIDS Spark plugin (com.nvidia.spark.SQLPlugin) is commonly configured.
This script can install NVIDIA cuDNN, a GPU-accelerated library for deep neural networks.
- If
include-pytorch=yesis specified orcudnn-versionis provided, a compatible version of cuDNN will be selected and installed based on the determined CUDA version. - To install a specific version of cuDNN, use the
cudnn-versionmetadata parameter (e.g.,--metadata cudnn-version=8.9.7.29). Please consult the cuDNN Archive and your deep learning framework's documentation for CUDA compatibility. The script may uselibcudnnpackages or tarball installations.
Example cuDNN Version Mapping (Illustrative):
| cuDNN Major.Minor | Example Full Version | Compatible CUDA Versions (General) |
|---|---|---|
| 8.6 | 8.6.0.163 | 10.2, 11.x |
| 8.9 | 8.9.7.29 | 11.x, 12.x |
| 9.x | e.g., 9.6.0.74 | 12.x |
This script accepts the following metadata parameters:
install-gpu-agent:true|false. Default:true. Installs GPU monitoring agent. Requires thehttps://www.googleapis.com/auth/monitoring.writescope.cuda-version: (Optional) Specify desired CUDA version (e.g.,11.8,12.4.1). Overrides default CUDA selection.cuda-url: (Optional) HTTP/HTTPS URL to a specific CUDA toolkit.runfile (e.g.,https://developer.download.nvidia.com/.../cuda_12.4.1_..._linux.run). Fetched usingcurl. Overridescuda-versionand default selection.gpu-driver-version: (Optional) Specify NVIDIA driver version (e.g.,550.90.07). Overrides default compatible driver selection.gpu-driver-url: (Optional) HTTP/HTTPS URL to a specific NVIDIA driver.runfile (e.g.,https://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-...run). Fetched usingcurl. Overridesgpu-driver-version.gpu-driver-provider: (Optional)OS|NVIDIA. Default:NVIDIA. Determines preference for OS-provided vs. NVIDIA-direct drivers. The script often prioritizes.runfiles or source builds for reliability.cudnn-version: (Optional) Specify cuDNN version (e.g.,8.9.7.29).nccl-version: (Optional) Specify NCCL version.include-pytorch: (Optional)yes|no. Default:no. Ifyes, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda environment.gpu-conda-env: (Optional) Name for the PyTorch Conda environment. Default:dpgce.container-runtime: (Optional) E.g.,docker,containerd,crio. For NVIDIA Container Toolkit configuration. Auto-detected if not specified.http-proxy: (Optional) Proxy address and port for HTTP requests (e.g.,your-proxy.com:3128).https-proxy: (Optional) Proxy address and port for HTTPS requests (e.g.,your-proxy.com:3128). Defaults tohttp-proxyif not set.proxy-uri: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden byhttp-proxyorhttps-proxyif they are set.no-proxy: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.http-proxy-pem-uri: (Optional) Ags://path to the PEM-encoded CA certificate file for the proxy specified inhttp-proxy/https-proxy. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.invocation-type: (For Custom Images) Set tocustom-imagesby image building tools. Not typically set by end-users creating clusters.- Secure Boot Signing Parameters: Used if Secure Boot is enabled and
you need to sign kernel modules built from source.
private_secret_name=<your-private-key-secret-name> public_secret_name=<your-public-cert-secret-name> secret_project=<your-gcp-project-id> secret_version=<your-secret-version> modulus_md5sum=<md5sum-of-your-mok-key-modulus>
This script includes robust support for environments requiring an HTTP/HTTPS proxy:
- Configuration: Use the
http-proxy,https-proxy, orproxy-urimetadata to specify your proxy server (host:port). - Custom CA Certificates: If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the
http-proxy-pem-urimetadata (as ags://path). The script will:- Install the CA into the system trust store (
update-ca-certificatesorupdate-ca-trust). - Add the CA to the Java cacerts trust store.
- Configure Conda to use the system trust store.
- Switch proxy communications to use HTTPS.
- Install the CA into the system trust store (
- Tool Configuration: The script automatically configures
curl,apt,dnf,gpg, and Java to use the specified proxy settings and custom CA if provided. - Bypass: The
no-proxymetadata allows specifying hosts to bypass the proxy. Defaults includelocalhost, the metadata server,.google.com, and.googleapis.comto ensure essential services function correctly. - Verification: The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
When the script needs to build NVIDIA kernel modules from source (e.g., using NVIDIA's open-gpu-kernel-modules repository, or if pre-built OS packages are not suitable), special considerations apply if Secure Boot is enabled.
- Secure Boot Active: Locally compiled modules must be signed with a key
trusted by the system's UEFI firmware.
- MOK Key Signing: Provide the Secure Boot signing metadata parameters
(listed above) to use keys stored in GCP Secret Manager. The public MOK
certificate must be enrolled in your base image's UEFI keystore. See
GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.shfor guidance on key creation and management. - Disabling Secure Boot (Unsecured Workaround): You can pass the
--no-shielded-secure-bootflag togcloud dataproc clusters create. This allows unsigned modules but disables Secure Boot's protections.
- MOK Key Signing: Provide the Secure Boot signing metadata parameters
(listed above) to use keys stored in GCP Secret Manager. The public MOK
certificate must be enrolled in your base image's UEFI keystore. See
- Error Indication: If a kernel module fails to load due to signature
issues while Secure Boot is active, check
/var/log/nvidia-installer.logordmesgoutput for errors like "Operation not permitted" or messages related to signature verification failure.
For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the GoogleCloudDataproc/custom-images repository, specifically the scripts within the examples/secure-boot/ directory.
Base Image: Typically Dataproc 2.2-debian12 or newer.
Process Overview:
-
Clone
custom-imagesRepository:git clone https://github.com/GoogleCloudDataproc/custom-images.git cd custom-images -
Configure Build: Set up
env.jsonwith your project, network, and bucket details. See theexamples/secure-boot/env.json.samplein thecustom-imagesrepo. -
Prepare Signing Keys: Ensure Secure Boot signing keys are available in GCP Secret Manager. Use
examples/secure-boot/create-key-pair.shfrom thecustom-imagesrepo to create/manage these. -
Build Docker Image: Build the builder environment:
docker build -t dataproc-secure-boot-builder:latest . -
Run Image Generation: Use
generate_custom_image.pywithin the Docker container, typically orchestrated byexamples/secure-boot/pre-init.sh. The core customization scriptexamples/secure-boot/install_gpu_driver.shhandles driver installation, proxy setup, and module signing.- Refer to the Secure Boot example documentation for detailed
docker runcommands and metadata requirements (proxy settings, secret names, etc.).
- Refer to the Secure Boot example documentation for detailed
Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled.
Important: To launch a Dataproc cluster with the --shielded-secure-boot flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules.
Network and Cluster Setup:
To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the GoogleCloudDataproc/cloud-dataproc repository:
-
Clone
cloud-dataprocRepository:git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git cd cloud-dataproc/gcloud -
Configure Environment:
- Copy
env.json.sampletoenv.json. - Edit
env.jsonwith your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example:{ "PROJECT_ID": "YOUR_GCP_PROJECT_ID", "REGION": "us-west4", "ZONE": "us-west4-a", "BUCKET": "YOUR_STAGING_BUCKET", "TEMP_BUCKET": "YOUR_TEMP_BUCKET", "CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME", "PURPOSE": "secure-boot-cluster", // Add these for a private, proxied environment "PRIVATE_RANGE": "10.43.79.0/24", "SWP_RANGE": "10.44.79.0/24", "SWP_IP": "10.43.79.245", "SWP_PORT": "3128", "SWP_HOSTNAME": "swp.your-project.example.com" // ... other variables as needed } - Set
CUSTOM_IMAGE_NAMEto the image you built in thecustom-imagesprocess.
- Copy
-
Create the Private Environment and Cluster: This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The
--shielded-secure-bootflag is handled internally by the scripts when aCUSTOM_IMAGE_NAMEis provided.bash bin/create-dpgce-private
Verification:
- SSH into the -m node of the created cluster.
- Check driver status:
sudo nvidia-smi - Verify module signature:
sudo modinfo nvidia | grep signer(should show your custom CA). - Check for errors:
dmesg | grep -iE "Secure Boot|NVRM|nvidia"
-
Once the cluster has been created, you can access the Dataproc cluster and verify NVIDIA drivers are installed successfully.
sudo nvidia-smi
-
If the CUDA toolkit was installed, verify the compiler:
/usr/local/cuda/bin/nvcc --version
-
If you install the GPU collection service (
install-gpu-agent=true, default), verify installation by using the following command:sudo systemctl status gpu-utilization-agent.service
(The service should be
active (running)).
For more information about GPU support, take a look at Dataproc documentation.
The GPU monitoring agent (installed when install-gpu-agent=true) automatically
collects and sends GPU utilization and memory usage metrics to Cloud Monitoring.
The agent is based on code from the
ml-on-gcp/gcp-gpu-utilization-metrics
repository. The create_gpu_metrics.py script mentioned in older
documentation is no longer used by this initialization action, as the agent
handles metric creation and reporting.
- Installation Failures: Examine the initialization action log on the
affected node, typically
/var/log/dataproc-initialization-script-0.log(or a similar name if multiple init actions are used). - Network/Proxy Issues: If using a proxy, double-check the
http-proxy,https-proxy,proxy-uri,no-proxy, andhttp-proxy-pem-urimetadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. - GPU Agent Issues: If the agent was installed (
install-gpu-agent=true), check its service logs usingsudo journalctl -u gpu-utilization-agent.service. - Driver Load or Secure Boot Problems: Review
dmesgoutput and/var/log/nvidia-installer.logfor errors related to module loading or signature verification. - "Points written too frequently" (GPU Agent): This was a known issue with
older versions of the
report_gpu_metrics.pyservice. The current script and agent versions aim to mitigate this. If encountered, check agent logs.
- This initialization script will install NVIDIA GPU drivers in all nodes in which a GPU is detected. If no GPUs are present on a node, most GPU-specific installation steps are skipped.
- Performance & Caching:
- The script extensively caches downloaded artifacts (drivers, CUDA
.runfiles) and compiled components (kernel modules, NCCL, Conda environments) to a GCS bucket. This bucket is typically specified by thedataproc-temp-bucketcluster property or metadata. Downloads and cache operations are proxy-aware. - First Run / Cache Warming: Initial runs on new configurations (OS,
kernel, or driver version combinations) that require source compilation
(e.g., for NCCL or kernel modules when no pre-compiled version is
available or suitable) can be time-consuming.
- On small instances (e.g., 2-core nodes), this process can take up to 150 minutes.
- To optimize and avoid long startup times on production clusters, it is highly recommended to "pre-warm" the GCS cache. This can be done by running the script once on a temporary, larger instance (e.g., a single-node, 32-core machine) with your target OS and desired GPU configuration. This will build and cache the necessary components. Subsequent cluster creations using the same cache bucket will be significantly faster (e.g., the init action might take 12-20 minutes on a large instance for the initial build, and then much faster on subsequent nodes using the cache).
- Security Benefit of Caching: When the script successfully finds and
uses cached, pre-built artifacts, it often bypasses the need to
install build tools (e.g.,
gcc,kernel-devel,make) on the cluster nodes. This reduces the attack surface area of the resulting cluster instances.
- The script extensively caches downloaded artifacts (drivers, CUDA
- SSHD configuration is hardened by default by the script.
- The script includes logic to manage APT sources and GPG keys for Debian-based systems, including handling of archived backports repositories to ensure dependencies can be met.
- Tested primarily with Dataproc 2.0+ images. Support for older Dataproc 1.5 images is limited.