Skip to content

feat: Enhanced Secure Boot GPU Image Building with Proxy Support#120

Draft
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-exercise-2025-11
Draft

feat: Enhanced Secure Boot GPU Image Building with Proxy Support#120
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-exercise-2025-11

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Nov 10, 2025

PR Summary: Secure Boot Image Build Enhancements and Proxy Support

This Pull Request introduces comprehensive updates to the Dataproc custom image build process, focusing on robust support for Secure Boot with NVIDIA GPUs, especially in environments requiring all egress traffic to pass through an HTTP/S proxy.

Key Changes:

  1. README.md Overhaul:

    • The examples/secure-boot/README.md has been completely rewritten to provide a comprehensive guide. This includes features, prerequisites, detailed configuration steps for env.json, and clear examples for building images both with and without proxy configurations. It also adds sections on usage, verification, key scripts, and a reference to the cloud-dataproc repository for environment setup.
  2. Podman Build Orchestration (examples/secure-boot/build-and-run-podman.sh):

    • A new wrapper script, build-and-run-podman.sh, has been introduced to streamline the build process using Podman. This script automates:
      • Service Account configuration and IAM role binding.
      • Generation of a local service account key (key.json) for use within the container.
      • Invocation of examples/secure-boot/create-key-pair.sh to manage Secure Boot keys.
      • Building the Dockerfile image.
      • Running the pre-init.sh script within the Podman container, mounting necessary volumes and environment variables.
  3. Refactored Libraries (examples/secure-boot/lib):

    • Common shell functions have been extracted into:
      • lib/env.sh: Handles loading and validation of settings from the unified env.json file.
      • lib/util.sh: Provides utilities for colored status messages, gcloud command execution with logging, and retry mechanisms.
    • Scripts like build-current-images.sh, create-key-pair.sh, and pre-init.sh have been updated to source these new libraries.
  4. Unified env.json Configuration:

    • examples/secure-boot/env.json.sample has been updated to include all necessary parameters for both the image build and the network/proxy setup (compatible with cloud-dataproc scripts), such as REGION, SWP_IP, SWP_PORT, etc. This allows for a single configuration file.
  5. Improved examples/secure-boot/pre-init.sh:

    • Now sources lib/env.sh and lib/util.sh.
    • Dynamically constructs the --metadata flag for generate_custom_image.py calls to include proxy settings (http-proxy, http-proxy-pem-uri) based on variables like SWP_IP, SWP_PORT, and PROXY_CERT_GCS_PATH loaded from env.json.
  6. New startup_script/gce-proxy-setup.sh:

    • This script is now available to be included in the build VM to configure system-wide proxy settings, package managers, GPG, Conda, and Java, based on metadata.
  7. custom_image_utils/shell_script_generator.py Updates:

    • Conditionally includes gce-proxy-setup.sh in the build sources if proxy metadata is detected.
    • Secure Boot signing key metadata is now added in the Python script by calling examples/secure-boot/create-key-pair.sh on the host system where the generator script is executed, rather than within the build VM script.
    • Minor cleanups and improved result checking in the generated script.
  8. Synchronization with initialization-actions:

    • The examples/secure-boot/install_gpu_driver.sh has been updated to align with the latest version in the GoogleCloudDataproc/initialization-actions repository, incorporating numerous fixes and enhancements for proxy handling, GPG key fetching, Conda/Mamba usage, and driver compilation.

These changes aim to provide a more streamlined, configurable, and robust solution for building Dataproc GPU images compatible with Secure Boot in complex network environments with HTTP/S proxies.

…Images

This commit introduces significant enhancements to the custom image building framework, primarily focused on supporting environments requiring HTTP/HTTPS proxy egress. This is critical for enterprise use cases with strict network policies. Additionally, it improves the robustness and reusability of the Secure Boot image generation process.

Key Changes:

1.  **Integrated Proxy Setup:**
    -   Introduced `startup_script/gce-proxy-setup.sh` to handle system-wide proxy configuration on the builder VM based on instance metadata (`http-proxy`, `no-proxy`, `http-proxy-pem-uri`). This includes settings for apt/dnf, GPG, Java, and Conda.
    -   `custom_image_utils/shell_script_generator.py` now conditionally uploads `gce-proxy-setup.sh` only if `http-proxy` metadata is present.
    -   `startup_script/run.sh` executes `gce-proxy-setup.sh` before the user's customization script if proxy metadata is provided.
    -   `gce-proxy-setup.sh` is designed to be idempotent.

2.  **Refactored Host vs. Container Setup:**
    -   `examples/secure-boot/build-and-run-podman.sh` now handles all host-side operations: sourcing environment, service account creation/validation, IAM bindings, and SA key generation (`key.json`).
    -   `examples/secure-boot/build-current-images.sh` now runs entirely within the container, consuming the mounted `key.json` via `GOOGLE_APPLICATION_CREDENTIALS`.
    -   Removed `gcloud config set` calls from scripts run inside the container, relying on the activated SA and per-command `--project` flags where needed.

3.  **Improved Build Script Logic:
    -   `examples/secure-boot/pre-init.sh` now uses a unique temporary directory per image version (e.g., `/tmp/2.1-debian11`) to prevent conflicts during concurrent builds in the screen session.
    -   Added `--project-id` to all `generate_custom_image.py` calls in `pre-init.sh`.
    -   `custom_image_utils/shell_script_generator.py` now includes `--project={project_id}` in more `gcloud` calls within the generated workflow script.
    -   Enhanced `examples/secure-boot/create-key-pair.sh` for more robust Secure Boot key handling and secret management.
    -   Added `VmDnsSetting=ZonalOnly` to instance metadata to address DNS warnings.

4.  **New Base Test Script:**
    -   Added `examples/secure-boot/no-customization.sh` to test the creation of base secure boot images without further customizations, including disk usage logging.

These changes provide a more reliable and flexible framework for building Dataproc custom images, especially for users in environments with network proxies and Secure Boot requirements.
@cjac cjac self-assigned this Nov 10, 2025
cjac added 2 commits November 20, 2025 23:13
This commit refactors how GPG keys for external repositories are imported within the `install_gpu_driver.sh` script. A new function, `import_gpg_keys`, is introduced in `install_gpu_driver.sh` to provide a consistent method for fetching keys from URLs or keyservers, handling potential proxy configurations, and importing them into specified keyring files.

Key Changes:

-   **New `import_gpg_keys` Function:** Added a robust function to download and import GPG keys, supporting both `--key-url` and `--key-id` arguments, with keyserver fallback and basic proxy support awareness for `curl`.
-   **Updated Repository Setup:** All functions responsible for adding package repositories (e.g., `add_repo_nvidia_container_toolkit`, `add_repo_cuda`, `clean_up_sources_lists`) have been updated to use the new `import_gpg_keys` function, simplifying and standardizing key management.
-   **Conda Package Order:** Minor reordering of packages in the `conda_pkg_list` for Debian 10 in `install_pytorch` function.
-   **Indentation Cleanup:** Fixed minor indentation in `set_proxy` default_no_proxy_list.

This refactoring improves the clarity, maintainability, and robustness of GPG key handling during the GPU driver and related software installation process.
This commit refactors the `set_proxy` function to provide more granular and flexible control over HTTP and HTTPS proxy configurations based on instance metadata.

**Key Enhancements:**

1.  **Attribute Prioritization:** The function now reads and respects the following metadata attributes in order of specificity:
    *   `http-proxy`: For setting HTTP_PROXY.
    *   `https-proxy`: For setting HTTPS_PROXY.
    *   `proxy-uri`: As a fallback for either HTTP_PROXY or HTTPS_PROXY if the specific attributes are not set.

2.  **Independent Configuration:** HTTP and HTTPS proxies can now be configured to different endpoints if both `http-proxy` and `https-proxy` are provided.

3.  **Conditional Environment Variables:** The `HTTP_PROXY` and `HTTPS_PROXY` environment variables (and their lowercase counterparts) are only set if a corresponding value is derived from the metadata. They are unset otherwise.

4.  **Clean `/etc/environment` Updates:** Existing proxy-related lines in `/etc/environment` are now removed before new ones are added, preventing duplicates.

5.  **Tool Configuration:**
    *   Package manager (apt/dnf) proxy settings are based on the first available value from `http-proxy` or `https-proxy` metadata.
    *   GnuPG's `dirmngr.conf` is only configured with an `http-proxy` if the `http-proxy` or `proxy-uri` metadata is provided.

6.  **Dynamic Scheme Change:** When `http-proxy-pem-uri` is provided and the certificate is processed, the function updates the relevant configurations (environment variables, package manager settings, dirmngr) to use the `https://` scheme for the proxy connections.

This refined logic allows users to precisely define their proxy setup, accommodating environments with distinct proxies for HTTP and HTTPS traffic, while maintaining backward compatibility with the single `proxy-uri` attribute.
@cjac cjac force-pushed the proxy-exercise-2025-11 branch from 7827ba4 to c439451 Compare January 20, 2026 03:07
@cjac cjac force-pushed the proxy-exercise-2025-11 branch from c439451 to 60b1359 Compare January 27, 2026 04:56
@cjac cjac force-pushed the proxy-exercise-2025-11 branch from 60b1359 to c5b639d Compare January 27, 2026 06:07
@cjac cjac requested a review from vinayakumarb January 27, 2026 18:28
@cjac cjac changed the title feat: Enhance Proxy Support and Build Infrastructure for Secure Boot … feat: Enhanced Secure Boot GPU Image Building with Proxy Support Jan 27, 2026
@cjac
Copy link
Contributor Author

cjac commented Jan 27, 2026

@google-ai-code-reviewer please review

@cjac cjac force-pushed the proxy-exercise-2025-11 branch from c5b639d to 7e02628 Compare January 28, 2026 02:47
This Pull Request introduces comprehensive updates to the Dataproc custom image build process, focusing on robust support for Secure Boot with NVIDIA GPUs, especially in environments requiring all egress traffic to pass through an HTTP/S proxy.

## Key Changes:

1.  **README.md Overhaul:**
    *   The `examples/secure-boot/README.md` is completely revamped to provide a comprehensive guide. This includes features, prerequisites, detailed configuration steps for `env.json`, and clear examples for building images both with and without proxy configurations. It also adds sections on usage, verification, key scripts, and a reference to the `cloud-dataproc` repository for environment setup.

2.  **Podman Build Orchestration (`examples/secure-boot/build-and-run-podman.sh`):**
    *   A new wrapper script, `build-and-run-podman.sh`, has been introduced to streamline the build process using Podman. This script automates:
        *   Sourcing environment variables from `lib/env.sh`.
        *   Service Account configuration and IAM role binding.
        *   Generation of a local service account key (`key.json`) for use within the container.
        *   Invocation of `examples/secure-boot/create-key-pair.sh` to manage Secure Boot keys.
        *   Building the container image using `podman build`.
        *   Running the `pre-init.sh` script within the Podman container with appropriate volume mounts and environment variables.
    *   Replaces previous Docker-based examples.

3.  **Refactored Libraries (`examples/secure-boot/lib`):**
    *   Created `env.sh` to load and validate environment variables from a central `env.json`.
    *   Created `util.sh` to house common shell utility functions for status logging, colored output, and wrapping `gcloud` calls.
    *   Scripts like `build-current-images.sh`, `create-key-pair.sh`, and `pre-init.sh` have been updated to source these new libraries.

4.  **Unified `env.json` Configuration:**
    *   `examples/secure-boot/env.json.sample` has been updated to include all necessary parameters for both the image build and the network/proxy setup (compatible with `cloud-dataproc` scripts), such as `REGION`, `SWP_IP`, `SWP_PORT`, etc. This allows for a single configuration file.

5.  **Improved `examples/secure-boot/pre-init.sh`:**
    *   Now sources `lib/env.sh` and `lib/util.sh`.
    *   Uses `python3` consistently.
    *   Dynamically constructs the `--metadata` flag for `generate_custom_image.py` calls to include proxy settings (`http-proxy`, `http-proxy-pem-uri`) based on variables like `SWP_IP`, `SWP_PORT`, and `PROXY_CERT_GCS_PATH` loaded from `env.json`.

6.  **New `startup_script/gce-proxy-setup.sh`:**
    *   This script is now available to be included in the build VM to configure system-wide proxy settings, package managers, GPG, Conda, and Java, based on metadata.

7.  **`custom_image_utils/shell_script_generator.py` Updates:**
    *   Conditionally includes `gce-proxy-setup.sh` in the build sources if proxy metadata is detected.
    *   Secure Boot signing key metadata is now added in the Python script by calling `examples/secure-boot/create-key-pair.sh` on the host system where the generator script is executed, rather than within the build VM script.
    *   Minor cleanups and improved result checking in the generated script.

8.  **Synchronization with `initialization-actions`:**
    *   The `examples/secure-boot/install_gpu_driver.sh` has been heavily updated to synchronize it with the latest version in the `GoogleCloudDataproc/initialization-actions` repository (specifically the `gpu` directory). This brings in substantial improvements for proxy handling, GPG key fetching, Conda/Mamba usage, and driver compilation.

These changes aim to provide a more streamlined, configurable, and robust solution for building Dataproc GPU images compatible with Secure Boot in complex network environments with HTTP/S proxies.
@cjac cjac force-pushed the proxy-exercise-2025-11 branch from 7e02628 to d8c160d Compare January 28, 2026 02:48
- Uncommented the build steps for the `secure-proxy` image in `pre-init.sh`.
- Added build steps for a new `proxy-tf` image based on `secure-proxy` to ensure proxy settings are baked in.
- Increased default disk size for `2.3-debian12` to 50GB for `tf` builds.
- Updated `examples/secure-boot/README.md` to:
    - Clarify the different image layers (`secure-boot`, `secure-proxy`, `tf`, `proxy-tf`).
    - Explicitly state the prerequisite of running `bin/create-dpgce-private` from the `cloud-dataproc` repo for proxy environments.
    - Guide users to use the `build-and-run-podman.sh` script.
    - Provide clear cluster creation examples for both `-proxy-tf` (baked-in proxy) and `-tf` (runtime proxy metadata) images.
…ge builds

This change prevents a malformed /etc/boto.cfg file in custom images
by ensuring the gcloud core/universe_domain property is set correctly
within the image build VM.

The issue stemmed from the base image's agent startup scripts
inadvertently adding TPC-specific settings in non-TPC environments
when the universe domain was not explicitly configured.

This fix introduces a --universe-domain argument to the
generate_custom_image.py script (defaulting to googleapis.com) and
passes this value as metadata to the builder VM. The run.sh startup
script now reads this metadata and explicitly sets the
core/universe_domain gcloud property, preventing the boto.cfg
corruption.
@cjac cjac requested a review from Deependra-Patel February 20, 2026 20:41
cjac added 4 commits February 21, 2026 03:24
…ge builds

This change prevents a malformed /etc/boto.cfg file in custom images
by ensuring the gcloud core/universe_domain property is set correctly
within the image build VM.

The issue stemmed from the base image's agent startup scripts
inadvertently adding TPC-specific settings in non-TPC environments
when the universe domain was not explicitly configured.

This fix introduces a --universe-domain argument to the
generate_custom_image.py script (defaulting to googleapis.com) and
passes this value as metadata to the builder VM. The run.sh startup
script now reads this metadata and explicitly sets the
core/universe_domain gcloud property, preventing the boto.cfg
corruption.
…2.2+

This change updates the custom image creation pipeline to enable Shielded
Secure Boot for the builder VM by default when the Dataproc version is 2.2
or newer.

Key changes:
- In `custom_image_utils/shell_script_generator.py`, the `generate` method
  now parses the `dataproc_version` to determine the major and minor version.
- For Dataproc 2.2+, the `--shielded-secure-boot` flag is added to the
  `gcloud compute instances create` command unless `trusted_cert` is
  explicitly set to an empty string.
- This ensures that builder VMs are created with Secure Boot enabled for
  modern images, facilitating the signing of kernel modules (e.g., GPU drivers)
  in a secure environment.
- Added `import re` to support version parsing logic.
- Added `{shielded_secure_boot_flag}` placeholder to the shell script template.
- Patch /usr/local/share/google/dataproc/bdutil/bdutil_universe.sh on the
  image disk to correctly resolve universe_domain, preventing boto.cfg
  corruption on subsequent cluster boots.
- Clean up any corrupted /etc/boto.cfg generated during the build.
- Enable --shielded-secure-boot for builder VMs on Dataproc 2.2+ by default.
In proxy-egress environments, the proxy must be configured before
attempting to install optional components via package managers (like apt
for Docker). This change extracts proxy setup into its own function and
executes it immediately after downloading the scripts, ensuring all
subsequent network calls are routed correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant