feat: Enhanced Secure Boot GPU Image Building with Proxy Support#120
Draft
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
Draft
feat: Enhanced Secure Boot GPU Image Building with Proxy Support#120cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
Conversation
…Images
This commit introduces significant enhancements to the custom image building framework, primarily focused on supporting environments requiring HTTP/HTTPS proxy egress. This is critical for enterprise use cases with strict network policies. Additionally, it improves the robustness and reusability of the Secure Boot image generation process.
Key Changes:
1. **Integrated Proxy Setup:**
- Introduced `startup_script/gce-proxy-setup.sh` to handle system-wide proxy configuration on the builder VM based on instance metadata (`http-proxy`, `no-proxy`, `http-proxy-pem-uri`). This includes settings for apt/dnf, GPG, Java, and Conda.
- `custom_image_utils/shell_script_generator.py` now conditionally uploads `gce-proxy-setup.sh` only if `http-proxy` metadata is present.
- `startup_script/run.sh` executes `gce-proxy-setup.sh` before the user's customization script if proxy metadata is provided.
- `gce-proxy-setup.sh` is designed to be idempotent.
2. **Refactored Host vs. Container Setup:**
- `examples/secure-boot/build-and-run-podman.sh` now handles all host-side operations: sourcing environment, service account creation/validation, IAM bindings, and SA key generation (`key.json`).
- `examples/secure-boot/build-current-images.sh` now runs entirely within the container, consuming the mounted `key.json` via `GOOGLE_APPLICATION_CREDENTIALS`.
- Removed `gcloud config set` calls from scripts run inside the container, relying on the activated SA and per-command `--project` flags where needed.
3. **Improved Build Script Logic:
- `examples/secure-boot/pre-init.sh` now uses a unique temporary directory per image version (e.g., `/tmp/2.1-debian11`) to prevent conflicts during concurrent builds in the screen session.
- Added `--project-id` to all `generate_custom_image.py` calls in `pre-init.sh`.
- `custom_image_utils/shell_script_generator.py` now includes `--project={project_id}` in more `gcloud` calls within the generated workflow script.
- Enhanced `examples/secure-boot/create-key-pair.sh` for more robust Secure Boot key handling and secret management.
- Added `VmDnsSetting=ZonalOnly` to instance metadata to address DNS warnings.
4. **New Base Test Script:**
- Added `examples/secure-boot/no-customization.sh` to test the creation of base secure boot images without further customizations, including disk usage logging.
These changes provide a more reliable and flexible framework for building Dataproc custom images, especially for users in environments with network proxies and Secure Boot requirements.
This commit refactors how GPG keys for external repositories are imported within the `install_gpu_driver.sh` script. A new function, `import_gpg_keys`, is introduced in `install_gpu_driver.sh` to provide a consistent method for fetching keys from URLs or keyservers, handling potential proxy configurations, and importing them into specified keyring files. Key Changes: - **New `import_gpg_keys` Function:** Added a robust function to download and import GPG keys, supporting both `--key-url` and `--key-id` arguments, with keyserver fallback and basic proxy support awareness for `curl`. - **Updated Repository Setup:** All functions responsible for adding package repositories (e.g., `add_repo_nvidia_container_toolkit`, `add_repo_cuda`, `clean_up_sources_lists`) have been updated to use the new `import_gpg_keys` function, simplifying and standardizing key management. - **Conda Package Order:** Minor reordering of packages in the `conda_pkg_list` for Debian 10 in `install_pytorch` function. - **Indentation Cleanup:** Fixed minor indentation in `set_proxy` default_no_proxy_list. This refactoring improves the clarity, maintainability, and robustness of GPG key handling during the GPU driver and related software installation process.
This commit refactors the `set_proxy` function to provide more granular and flexible control over HTTP and HTTPS proxy configurations based on instance metadata.
**Key Enhancements:**
1. **Attribute Prioritization:** The function now reads and respects the following metadata attributes in order of specificity:
* `http-proxy`: For setting HTTP_PROXY.
* `https-proxy`: For setting HTTPS_PROXY.
* `proxy-uri`: As a fallback for either HTTP_PROXY or HTTPS_PROXY if the specific attributes are not set.
2. **Independent Configuration:** HTTP and HTTPS proxies can now be configured to different endpoints if both `http-proxy` and `https-proxy` are provided.
3. **Conditional Environment Variables:** The `HTTP_PROXY` and `HTTPS_PROXY` environment variables (and their lowercase counterparts) are only set if a corresponding value is derived from the metadata. They are unset otherwise.
4. **Clean `/etc/environment` Updates:** Existing proxy-related lines in `/etc/environment` are now removed before new ones are added, preventing duplicates.
5. **Tool Configuration:**
* Package manager (apt/dnf) proxy settings are based on the first available value from `http-proxy` or `https-proxy` metadata.
* GnuPG's `dirmngr.conf` is only configured with an `http-proxy` if the `http-proxy` or `proxy-uri` metadata is provided.
6. **Dynamic Scheme Change:** When `http-proxy-pem-uri` is provided and the certificate is processed, the function updates the relevant configurations (environment variables, package manager settings, dirmngr) to use the `https://` scheme for the proxy connections.
This refined logic allows users to precisely define their proxy setup, accommodating environments with distinct proxies for HTTP and HTTPS traffic, while maintaining backward compatibility with the single `proxy-uri` attribute.
7827ba4 to
c439451
Compare
c439451 to
60b1359
Compare
60b1359 to
c5b639d
Compare
Contributor
Author
|
@google-ai-code-reviewer please review |
c5b639d to
7e02628
Compare
This Pull Request introduces comprehensive updates to the Dataproc custom image build process, focusing on robust support for Secure Boot with NVIDIA GPUs, especially in environments requiring all egress traffic to pass through an HTTP/S proxy.
## Key Changes:
1. **README.md Overhaul:**
* The `examples/secure-boot/README.md` is completely revamped to provide a comprehensive guide. This includes features, prerequisites, detailed configuration steps for `env.json`, and clear examples for building images both with and without proxy configurations. It also adds sections on usage, verification, key scripts, and a reference to the `cloud-dataproc` repository for environment setup.
2. **Podman Build Orchestration (`examples/secure-boot/build-and-run-podman.sh`):**
* A new wrapper script, `build-and-run-podman.sh`, has been introduced to streamline the build process using Podman. This script automates:
* Sourcing environment variables from `lib/env.sh`.
* Service Account configuration and IAM role binding.
* Generation of a local service account key (`key.json`) for use within the container.
* Invocation of `examples/secure-boot/create-key-pair.sh` to manage Secure Boot keys.
* Building the container image using `podman build`.
* Running the `pre-init.sh` script within the Podman container with appropriate volume mounts and environment variables.
* Replaces previous Docker-based examples.
3. **Refactored Libraries (`examples/secure-boot/lib`):**
* Created `env.sh` to load and validate environment variables from a central `env.json`.
* Created `util.sh` to house common shell utility functions for status logging, colored output, and wrapping `gcloud` calls.
* Scripts like `build-current-images.sh`, `create-key-pair.sh`, and `pre-init.sh` have been updated to source these new libraries.
4. **Unified `env.json` Configuration:**
* `examples/secure-boot/env.json.sample` has been updated to include all necessary parameters for both the image build and the network/proxy setup (compatible with `cloud-dataproc` scripts), such as `REGION`, `SWP_IP`, `SWP_PORT`, etc. This allows for a single configuration file.
5. **Improved `examples/secure-boot/pre-init.sh`:**
* Now sources `lib/env.sh` and `lib/util.sh`.
* Uses `python3` consistently.
* Dynamically constructs the `--metadata` flag for `generate_custom_image.py` calls to include proxy settings (`http-proxy`, `http-proxy-pem-uri`) based on variables like `SWP_IP`, `SWP_PORT`, and `PROXY_CERT_GCS_PATH` loaded from `env.json`.
6. **New `startup_script/gce-proxy-setup.sh`:**
* This script is now available to be included in the build VM to configure system-wide proxy settings, package managers, GPG, Conda, and Java, based on metadata.
7. **`custom_image_utils/shell_script_generator.py` Updates:**
* Conditionally includes `gce-proxy-setup.sh` in the build sources if proxy metadata is detected.
* Secure Boot signing key metadata is now added in the Python script by calling `examples/secure-boot/create-key-pair.sh` on the host system where the generator script is executed, rather than within the build VM script.
* Minor cleanups and improved result checking in the generated script.
8. **Synchronization with `initialization-actions`:**
* The `examples/secure-boot/install_gpu_driver.sh` has been heavily updated to synchronize it with the latest version in the `GoogleCloudDataproc/initialization-actions` repository (specifically the `gpu` directory). This brings in substantial improvements for proxy handling, GPG key fetching, Conda/Mamba usage, and driver compilation.
These changes aim to provide a more streamlined, configurable, and robust solution for building Dataproc GPU images compatible with Secure Boot in complex network environments with HTTP/S proxies.
7e02628 to
d8c160d
Compare
- Uncommented the build steps for the `secure-proxy` image in `pre-init.sh`.
- Added build steps for a new `proxy-tf` image based on `secure-proxy` to ensure proxy settings are baked in.
- Increased default disk size for `2.3-debian12` to 50GB for `tf` builds.
- Updated `examples/secure-boot/README.md` to:
- Clarify the different image layers (`secure-boot`, `secure-proxy`, `tf`, `proxy-tf`).
- Explicitly state the prerequisite of running `bin/create-dpgce-private` from the `cloud-dataproc` repo for proxy environments.
- Guide users to use the `build-and-run-podman.sh` script.
- Provide clear cluster creation examples for both `-proxy-tf` (baked-in proxy) and `-tf` (runtime proxy metadata) images.
…ge builds This change prevents a malformed /etc/boto.cfg file in custom images by ensuring the gcloud core/universe_domain property is set correctly within the image build VM. The issue stemmed from the base image's agent startup scripts inadvertently adding TPC-specific settings in non-TPC environments when the universe domain was not explicitly configured. This fix introduces a --universe-domain argument to the generate_custom_image.py script (defaulting to googleapis.com) and passes this value as metadata to the builder VM. The run.sh startup script now reads this metadata and explicitly sets the core/universe_domain gcloud property, preventing the boto.cfg corruption.
…ge builds This change prevents a malformed /etc/boto.cfg file in custom images by ensuring the gcloud core/universe_domain property is set correctly within the image build VM. The issue stemmed from the base image's agent startup scripts inadvertently adding TPC-specific settings in non-TPC environments when the universe domain was not explicitly configured. This fix introduces a --universe-domain argument to the generate_custom_image.py script (defaulting to googleapis.com) and passes this value as metadata to the builder VM. The run.sh startup script now reads this metadata and explicitly sets the core/universe_domain gcloud property, preventing the boto.cfg corruption.
…2.2+
This change updates the custom image creation pipeline to enable Shielded
Secure Boot for the builder VM by default when the Dataproc version is 2.2
or newer.
Key changes:
- In `custom_image_utils/shell_script_generator.py`, the `generate` method
now parses the `dataproc_version` to determine the major and minor version.
- For Dataproc 2.2+, the `--shielded-secure-boot` flag is added to the
`gcloud compute instances create` command unless `trusted_cert` is
explicitly set to an empty string.
- This ensures that builder VMs are created with Secure Boot enabled for
modern images, facilitating the signing of kernel modules (e.g., GPU drivers)
in a secure environment.
- Added `import re` to support version parsing logic.
- Added `{shielded_secure_boot_flag}` placeholder to the shell script template.
- Patch /usr/local/share/google/dataproc/bdutil/bdutil_universe.sh on the image disk to correctly resolve universe_domain, preventing boto.cfg corruption on subsequent cluster boots. - Clean up any corrupted /etc/boto.cfg generated during the build. - Enable --shielded-secure-boot for builder VMs on Dataproc 2.2+ by default.
In proxy-egress environments, the proxy must be configured before attempting to install optional components via package managers (like apt for Docker). This change extracts proxy setup into its own function and executes it immediately after downloading the scripts, ensuring all subsequent network calls are routed correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Summary: Secure Boot Image Build Enhancements and Proxy Support
This Pull Request introduces comprehensive updates to the Dataproc custom image build process, focusing on robust support for Secure Boot with NVIDIA GPUs, especially in environments requiring all egress traffic to pass through an HTTP/S proxy.
Key Changes:
README.md Overhaul:
examples/secure-boot/README.mdhas been completely rewritten to provide a comprehensive guide. This includes features, prerequisites, detailed configuration steps forenv.json, and clear examples for building images both with and without proxy configurations. It also adds sections on usage, verification, key scripts, and a reference to thecloud-dataprocrepository for environment setup.Podman Build Orchestration (
examples/secure-boot/build-and-run-podman.sh):build-and-run-podman.sh, has been introduced to streamline the build process using Podman. This script automates:key.json) for use within the container.examples/secure-boot/create-key-pair.shto manage Secure Boot keys.pre-init.shscript within the Podman container, mounting necessary volumes and environment variables.Refactored Libraries (
examples/secure-boot/lib):lib/env.sh: Handles loading and validation of settings from the unifiedenv.jsonfile.lib/util.sh: Provides utilities for colored status messages,gcloudcommand execution with logging, and retry mechanisms.build-current-images.sh,create-key-pair.sh, andpre-init.shhave been updated to source these new libraries.Unified
env.jsonConfiguration:examples/secure-boot/env.json.samplehas been updated to include all necessary parameters for both the image build and the network/proxy setup (compatible withcloud-dataprocscripts), such asREGION,SWP_IP,SWP_PORT, etc. This allows for a single configuration file.Improved
examples/secure-boot/pre-init.sh:lib/env.shandlib/util.sh.--metadataflag forgenerate_custom_image.pycalls to include proxy settings (http-proxy,http-proxy-pem-uri) based on variables likeSWP_IP,SWP_PORT, andPROXY_CERT_GCS_PATHloaded fromenv.json.New
startup_script/gce-proxy-setup.sh:custom_image_utils/shell_script_generator.pyUpdates:gce-proxy-setup.shin the build sources if proxy metadata is detected.examples/secure-boot/create-key-pair.shon the host system where the generator script is executed, rather than within the build VM script.Synchronization with
initialization-actions:examples/secure-boot/install_gpu_driver.shhas been updated to align with the latest version in theGoogleCloudDataproc/initialization-actionsrepository, incorporating numerous fixes and enhancements for proxy handling, GPG key fetching, Conda/Mamba usage, and driver compilation.These changes aim to provide a more streamlined, configurable, and robust solution for building Dataproc GPU images compatible with Secure Boot in complex network environments with HTTP/S proxies.