-
Notifications
You must be signed in to change notification settings - Fork 516
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
f2cee3d
ba91cf7
3f5811e
897684d
bded17b
184970d
8fce880
7ab81df
739fb91
3e1122a
46dedef
7902695
4dab2cb
6a01b50
e37b8eb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,8 +2,8 @@ | |
|
|
||
| GPUs require special drivers and software which are not pre-installed on | ||
| [Dataproc](https://cloud.google.com/dataproc) clusters by default. | ||
| This initialization action installs GPU driver for NVIDIA GPUs on master and | ||
| worker nodes in a Dataproc cluster. | ||
| This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and | ||
| -w nodes in a Dataproc cluster. | ||
cjac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Default versions | ||
|
|
||
|
|
@@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable | |
| will select compatible values for Driver, cuDNN, and NCCL from the script's | ||
| internal matrix. Default CUDA versions are typically: | ||
|
|
||
| * Dataproc 1.5: `11.6.2` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| * Dataproc 2.0: `12.1.1` | ||
| * Dataproc 2.1: `12.4.1` | ||
| * Dataproc 2.2 & 2.3: `12.6.3` | ||
|
|
@@ -26,10 +27,12 @@ Refer to internal arrays in `install_gpu_driver.sh` for the full matrix.)* | |
|
|
||
| CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Versions | ||
| -----| ------------ | --------- | --------- | -------| --------------------------- | ||
| 11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04) | ||
| 12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04) | ||
| 12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+ | ||
| 12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+ | ||
| 11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11, Ubuntu 22.04) | ||
| 12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11/12, Rocky 9, Ubuntu 22.04) | ||
cjac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Debian 10/11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04) | ||
| 12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Debian 11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04) | ||
|
|
||
| *Note: Secure Boot is only supported on Dataproc 2.2+ images.* | ||
|
|
||
| **Supported Operating Systems:** | ||
|
|
||
|
|
@@ -191,20 +194,19 @@ This script accepts the following metadata parameters: | |
| * `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`). | ||
| * `nccl-version`: (Optional) Specify NCCL version. | ||
| * `include-pytorch`: (Optional) `yes`|`no`. Default: `no`. | ||
| If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda | ||
| If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda | ||
cjac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| environment. | ||
cjac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment. | ||
| Default: `dpgce`. | ||
| * `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`. | ||
| For NVIDIA Container Toolkit configuration. Auto-detected if not specified. | ||
| * `http-proxy`: (Optional) URL of an HTTP proxy for downloads. | ||
| * `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`). | ||
| * `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set. | ||
| * `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set. | ||
| * `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults. | ||
| * `http-proxy-pem-uri`: (Optional) A `gs://` path to the | ||
| PEM-encoded certificate file used by the proxy specified in | ||
| `http-proxy`. This is needed if the proxy uses TLS and its | ||
| certificate is not already trusted by the cluster's default trust | ||
| store (e.g., if it's a self-signed certificate or signed by an | ||
| internal CA). The script will install this certificate into the | ||
| system and Java trust stores. | ||
| PEM-encoded CA certificate file for the proxy specified in | ||
| `http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS. | ||
cjac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * `invocation-type`: (For Custom Images) Set to `custom-images` by image | ||
| building tools. Not typically set by end-users creating clusters. | ||
| * **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and | ||
|
|
@@ -217,6 +219,20 @@ This script accepts the following metadata parameters: | |
| modulus_md5sum=<md5sum-of-your-mok-key-modulus> | ||
| ``` | ||
|
|
||
| ### Enhanced Proxy Support | ||
|
|
||
| This script includes robust support for environments requiring an HTTP/HTTPS proxy: | ||
|
|
||
| * **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port). | ||
| * **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will: | ||
| * Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`). | ||
| * Add the CA to the Java cacerts trust store. | ||
| * Configure Conda to use the system trust store. | ||
| * Switch proxy communications to use HTTPS. | ||
| * **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided. | ||
| * **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly. | ||
| * **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads. | ||
|
Comment on lines
+228
to
+242
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ### Loading Built Kernel Module & Secure Boot | ||
|
|
||
| When the script needs to build NVIDIA kernel modules from source (e.g., using | ||
|
|
@@ -238,6 +254,82 @@ not suitable), special considerations apply if Secure Boot is enabled. | |
| or `dmesg` output for errors like "Operation not permitted" or messages | ||
| related to signature verification failure. | ||
|
|
||
| ## Building Custom Images with Secure Boot and Proxy Support | ||
|
|
||
| For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the [GoogleCloudDataproc/custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, specifically the scripts within the `examples/secure-boot/` directory. | ||
|
|
||
| **Base Image:** Typically Dataproc 2.2-debian12 or newer. | ||
|
|
||
| **Process Overview:** | ||
|
|
||
| 1. **Clone `custom-images` Repository:** | ||
| ```bash | ||
| git clone https://github.com/GoogleCloudDataproc/custom-images.git | ||
| cd custom-images | ||
| ``` | ||
|
|
||
| 2. **Configure Build:** Set up `env.json` with your project, network, and bucket details. See the `examples/secure-boot/env.json.sample` in the `custom-images` repo. | ||
|
|
||
| 3. **Prepare Signing Keys:** Ensure Secure Boot signing keys are available in GCP Secret Manager. Use `examples/secure-boot/create-key-pair.sh` from the `custom-images` repo to create/manage these. | ||
|
|
||
| 4. **Build Docker Image:** Build the builder environment: `docker build -t dataproc-secure-boot-builder:latest .` | ||
|
|
||
| 5. **Run Image Generation:** Use `generate_custom_image.py` within the Docker container, typically orchestrated by `examples/secure-boot/pre-init.sh`. The core customization script `examples/secure-boot/install_gpu_driver.sh` handles driver installation, proxy setup, and module signing. | ||
|
|
||
| * Refer to the [Secure Boot example documentation](https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot) for detailed `docker run` commands and metadata requirements (proxy settings, secret names, etc.). | ||
|
|
||
| ### Launching a Cluster with the Secure Boot Custom Image | ||
|
|
||
| Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled. | ||
|
|
||
| **Important:** To launch a Dataproc cluster with the `--shielded-secure-boot` flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules. | ||
|
|
||
| **Network and Cluster Setup:** | ||
|
|
||
| To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository: | ||
|
|
||
| 1. **Clone `cloud-dataproc` Repository:** | ||
| ```bash | ||
| git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git | ||
| cd cloud-dataproc/gcloud | ||
| ``` | ||
|
|
||
| 2. **Configure Environment:** | ||
| * Copy `env.json.sample` to `env.json`. | ||
| * Edit `env.json` with your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example: | ||
| ```json | ||
| { | ||
| "PROJECT_ID": "YOUR_GCP_PROJECT_ID", | ||
| "REGION": "us-west4", | ||
| "ZONE": "us-west4-a", | ||
| "BUCKET": "YOUR_STAGING_BUCKET", | ||
| "TEMP_BUCKET": "YOUR_TEMP_BUCKET", | ||
| "CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME", | ||
| "PURPOSE": "secure-boot-cluster", | ||
| // Add these for a private, proxied environment | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. // may not be valid |
||
| "PRIVATE_RANGE": "10.43.79.0/24", | ||
| "SWP_RANGE": "10.44.79.0/24", | ||
| "SWP_IP": "10.43.79.245", | ||
| "SWP_PORT": "3128", | ||
| "SWP_HOSTNAME": "swp.your-project.example.com" | ||
| // ... other variables as needed | ||
| } | ||
| ``` | ||
| * Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process. | ||
|
|
||
| 3. **Create the Private Environment and Cluster:** | ||
| This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a note: "make sure that you are in the right directory ... |
||
| ```bash | ||
| bash bin/create-dpgce-private | ||
| ``` | ||
|
|
||
| **Verification:** | ||
|
|
||
| 1. SSH into the -m node of the created cluster. | ||
| 2. Check driver status: `sudo nvidia-smi` | ||
| 3. Verify module signature: `sudo modinfo nvidia | grep signer` (should show your custom CA). | ||
| 4. Check for errors: `dmesg | grep -iE "Secure Boot|NVRM|nvidia"` | ||
|
|
||
| ### Verification | ||
|
|
||
| 1. Once the cluster has been created, you can access the Dataproc cluster and | ||
|
|
@@ -280,6 +372,7 @@ handles metric creation and reporting. | |
| * **Installation Failures:** Examine the initialization action log on the | ||
| affected node, typically `/var/log/dataproc-initialization-script-0.log` | ||
| (or a similar name if multiple init actions are used). | ||
| * **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. | ||
|
||
| * **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`), | ||
| check its service logs using `sudo journalctl -u gpu-utilization-agent.service`. | ||
| * **Driver Load or Secure Boot Problems:** Review `dmesg` output and | ||
|
|
@@ -298,7 +391,7 @@ handles metric creation and reporting. | |
| * The script extensively caches downloaded artifacts (drivers, CUDA `.run` | ||
| files) and compiled components (kernel modules, NCCL, Conda environments) | ||
| to a GCS bucket. This bucket is typically specified by the | ||
| `dataproc-temp-bucket` cluster property or metadata. | ||
| `dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| * **First Run / Cache Warming:** Initial runs on new configurations (OS, | ||
| kernel, or driver version combinations) that require source compilation | ||
| (e.g., for NCCL or kernel modules when no pre-compiled version is | ||
|
|
@@ -324,4 +417,4 @@ handles metric creation and reporting. | |
| Debian-based systems, including handling of archived backports repositories | ||
| to ensure dependencies can be met. | ||
| * Tested primarily with Dataproc 2.0+ images. Support for older Dataproc | ||
| 1.5 images is limited. | ||
| 1.5 images is limited. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need to replace all variables like$BUILD_ID and others with $ {BUILD_ID}