Skip to content
Draft
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions cloudbuild/cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,39 @@ steps:
- 'CLOUDSDK_COMPUTE_REGION=us-central1'
- 'CLOUDSDK_CONTAINER_CLUSTER=init-actions-presubmit'

# Run presubmit tests in parallel for 2.3 Debian image
- name: 'gcr.io/cloud-builders/kubectl'
id: 'dataproc-2.3-debian12-tests'
waitFor: ['gcr-push']
entrypoint: 'bash'
args: ['cloudbuild/run-presubmit-on-k8s.sh', 'gcr.io/$PROJECT_ID/init-actions-image:$BUILD_ID', '$BUILD_ID', '2.3-debian12']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need to replace all variables like $BUILD_ID and others with ${BUILD_ID}

env:
- 'COMMIT_SHA=$COMMIT_SHA'
- 'CLOUDSDK_COMPUTE_REGION=us-central1'
- 'CLOUDSDK_CONTAINER_CLUSTER=init-actions-presubmit'

# Run presubmit tests in parallel for 2.3 Rocky Linux image
- name: 'gcr.io/cloud-builders/kubectl'
id: 'dataproc-2.3-rocky9-tests'
waitFor: ['gcr-push']
entrypoint: 'bash'
args: ['cloudbuild/run-presubmit-on-k8s.sh', 'gcr.io/$PROJECT_ID/init-actions-image:$BUILD_ID', '$BUILD_ID', '2.3-rocky9']
env:
- 'COMMIT_SHA=$COMMIT_SHA'
- 'CLOUDSDK_COMPUTE_REGION=us-central1'
- 'CLOUDSDK_CONTAINER_CLUSTER=init-actions-presubmit'

# Run presubmit tests in parallel for 2.3 Ubuntu image
- name: 'gcr.io/cloud-builders/kubectl'
id: 'dataproc-2.3-ubuntu22-tests'
waitFor: ['gcr-push']
entrypoint: 'bash'
args: ['cloudbuild/run-presubmit-on-k8s.sh', 'gcr.io/$PROJECT_ID/init-actions-image:$BUILD_ID', '$BUILD_ID', '2.3-ubuntu22']
env:
- 'COMMIT_SHA=$COMMIT_SHA'
- 'CLOUDSDK_COMPUTE_REGION=us-central1'
- 'CLOUDSDK_CONTAINER_CLUSTER=init-actions-presubmit'

# Delete Docker image from GCR
- name: 'gcr.io/cloud-builders/gcloud'
args: ['container', 'images', 'delete', 'gcr.io/$PROJECT_ID/init-actions-image:$BUILD_ID']
Expand Down
125 changes: 109 additions & 16 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

GPUs require special drivers and software which are not pre-installed on
[Dataproc](https://cloud.google.com/dataproc) clusters by default.
This initialization action installs GPU driver for NVIDIA GPUs on master and
worker nodes in a Dataproc cluster.
This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and
-w nodes in a Dataproc cluster.

## Default versions

Expand All @@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable
will select compatible values for Driver, cuDNN, and NCCL from the script's
internal matrix. Default CUDA versions are typically:

* Dataproc 1.5: `11.6.2`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding Dataproc 1.5 CUDA version to the default CUDA versions list improves the documentation's completeness and helps users understand the supported versions for older Dataproc images.

* Dataproc 2.0: `12.1.1`
* Dataproc 2.1: `12.4.1`
* Dataproc 2.2 & 2.3: `12.6.3`
Expand All @@ -26,10 +27,12 @@ Refer to internal arrays in `install_gpu_driver.sh` for the full matrix.)*

CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Versions
-----| ------------ | --------- | --------- | -------| ---------------------------
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11, Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11/12, Rocky 9, Ubuntu 22.04)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Debian 10/11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04)
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Debian 11, Ubuntu 20.04); 2.2+ (Debian 11/12, Rocky 9, Ubuntu 22.04)

*Note: Secure Boot is only supported on Dataproc 2.2+ images.*

**Supported Operating Systems:**

Expand Down Expand Up @@ -191,20 +194,19 @@ This script accepts the following metadata parameters:
* `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
* `nccl-version`: (Optional) Specify NCCL version.
* `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda
environment.
* `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
Default: `dpgce`.
* `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
* `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
* `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
* `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
* `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
* `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
* `http-proxy-pem-uri`: (Optional) A `gs://` path to the
PEM-encoded certificate file used by the proxy specified in
`http-proxy`. This is needed if the proxy uses TLS and its
certificate is not already trusted by the cluster's default trust
store (e.g., if it's a self-signed certificate or signed by an
internal CA). The script will install this certificate into the
system and Java trust stores.
PEM-encoded CA certificate file for the proxy specified in
`http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
* `invocation-type`: (For Custom Images) Set to `custom-images` by image
building tools. Not typically set by end-users creating clusters.
* **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and
Expand All @@ -217,6 +219,20 @@ This script accepts the following metadata parameters:
modulus_md5sum=<md5sum-of-your-mok-key-modulus>
```

### Enhanced Proxy Support

This script includes robust support for environments requiring an HTTP/HTTPS proxy:

* **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port).
* **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will:
* Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`).
* Add the CA to the Java cacerts trust store.
* Configure Conda to use the system trust store.
* Switch proxy communications to use HTTPS.
* **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided.
* **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly.
* **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
Comment on lines +228 to +242
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new 'Enhanced Proxy Support' section clearly outlines the capabilities and configuration options for proxy usage, including custom CA certificates and tool integration. This detailed explanation is highly beneficial for users operating in proxied environments.


### Loading Built Kernel Module & Secure Boot

When the script needs to build NVIDIA kernel modules from source (e.g., using
Expand All @@ -238,6 +254,82 @@ not suitable), special considerations apply if Secure Boot is enabled.
or `dmesg` output for errors like "Operation not permitted" or messages
related to signature verification failure.

## Building Custom Images with Secure Boot and Proxy Support

For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the [GoogleCloudDataproc/custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, specifically the scripts within the `examples/secure-boot/` directory.

**Base Image:** Typically Dataproc 2.2-debian12 or newer.

**Process Overview:**

1. **Clone `custom-images` Repository:**
```bash
git clone https://github.com/GoogleCloudDataproc/custom-images.git
cd custom-images
```

2. **Configure Build:** Set up `env.json` with your project, network, and bucket details. See the `examples/secure-boot/env.json.sample` in the `custom-images` repo.

3. **Prepare Signing Keys:** Ensure Secure Boot signing keys are available in GCP Secret Manager. Use `examples/secure-boot/create-key-pair.sh` from the `custom-images` repo to create/manage these.

4. **Build Docker Image:** Build the builder environment: `docker build -t dataproc-secure-boot-builder:latest .`

5. **Run Image Generation:** Use `generate_custom_image.py` within the Docker container, typically orchestrated by `examples/secure-boot/pre-init.sh`. The core customization script `examples/secure-boot/install_gpu_driver.sh` handles driver installation, proxy setup, and module signing.

* Refer to the [Secure Boot example documentation](https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot) for detailed `docker run` commands and metadata requirements (proxy settings, secret names, etc.).

### Launching a Cluster with the Secure Boot Custom Image

Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled.

**Important:** To launch a Dataproc cluster with the `--shielded-secure-boot` flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules.

**Network and Cluster Setup:**

To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository:

1. **Clone `cloud-dataproc` Repository:**
```bash
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
cd cloud-dataproc/gcloud
```

2. **Configure Environment:**
* Copy `env.json.sample` to `env.json`.
* Edit `env.json` with your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example:
```json
{
"PROJECT_ID": "YOUR_GCP_PROJECT_ID",
"REGION": "us-west4",
"ZONE": "us-west4-a",
"BUCKET": "YOUR_STAGING_BUCKET",
"TEMP_BUCKET": "YOUR_TEMP_BUCKET",
"CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME",
"PURPOSE": "secure-boot-cluster",
// Add these for a private, proxied environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// may not be valid

"PRIVATE_RANGE": "10.43.79.0/24",
"SWP_RANGE": "10.44.79.0/24",
"SWP_IP": "10.43.79.245",
"SWP_PORT": "3128",
"SWP_HOSTNAME": "swp.your-project.example.com"
// ... other variables as needed
}
```
* Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process.

3. **Create the Private Environment and Cluster:**
This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note: "make sure that you are in the right directory ...

```bash
bash bin/create-dpgce-private
```

**Verification:**

1. SSH into the -m node of the created cluster.
2. Check driver status: `sudo nvidia-smi`
3. Verify module signature: `sudo modinfo nvidia | grep signer` (should show your custom CA).
4. Check for errors: `dmesg | grep -iE "Secure Boot|NVRM|nvidia"`

### Verification

1. Once the cluster has been created, you can access the Dataproc cluster and
Expand Down Expand Up @@ -280,6 +372,7 @@ handles metric creation and reporting.
* **Installation Failures:** Examine the initialization action log on the
affected node, typically `/var/log/dataproc-initialization-script-0.log`
(or a similar name if multiple init actions are used).
* **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding a specific troubleshooting entry for 'Network/Proxy Issues' is a valuable addition. It guides users directly to relevant metadata settings and log checks, which will significantly aid in diagnosing connectivity problems.

* **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`),
check its service logs using `sudo journalctl -u gpu-utilization-agent.service`.
* **Driver Load or Secure Boot Problems:** Review `dmesg` output and
Expand All @@ -298,7 +391,7 @@ handles metric creation and reporting.
* The script extensively caches downloaded artifacts (drivers, CUDA `.run`
files) and compiled components (kernel modules, NCCL, Conda environments)
to a GCS bucket. This bucket is typically specified by the
`dataproc-temp-bucket` cluster property or metadata.
`dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Clarifying that 'Downloads and cache operations are proxy-aware' in the 'Performance & Caching' section is important. It assures users that the caching mechanism will function correctly even when a proxy is configured.

* **First Run / Cache Warming:** Initial runs on new configurations (OS,
kernel, or driver version combinations) that require source compilation
(e.g., for NCCL or kernel modules when no pre-compiled version is
Expand All @@ -324,4 +417,4 @@ handles metric creation and reporting.
Debian-based systems, including handling of archived backports repositories
to ensure dependencies can be met.
* Tested primarily with Dataproc 2.0+ images. Support for older Dataproc
1.5 images is limited.
1.5 images is limited.
Loading