Add GPU driver installer digest verification as part of confidential GPU driver installation flow by meetrajvala · Pull Request #565 · google/go-tpm-tools

meetrajvala · 2025-04-10T19:41:28Z

Changes:

This PR contains the following changes:

Adds cos_gpu_installer image digest verification check before launching the installer container. For image reference, it now refers to the file stored in OEM partition instead of finding the image reference at runtime.
Changes in preload.sh file to add the cos_gpu_installer image digest and image reference files under OEM partition. Here manifests API is used to get the image digest (for given image reference) because we do not have docker and gcloud available in the build container.
Update the experiment flag to new one (named EnableConfidentialGPUSupport) as EnableGpuDriverInstallation (existing flag) is being used by trusted space and it would be good to maintain these under different feature flags.
Update existing image tests to be more specific for confidential GPU.

Testing:

Manual testing
Existing image tests for confidential GPU ran successfully.
Unit tests for relevant helper methods

Changes which would be part of follow-up PR:

Measure GPU CC mode status.
Add driver hash file under OEM partition as part of image build process and use it to verify the installed drivers at runtime.
Update/Add relevant image tests

meetrajvala · 2025-04-10T20:00:30Z

/gcbrun

yawangwang

Could you also add what changes you plan to make in the PR description?

yawangwang · 2025-04-15T00:45:00Z

launcher/internal/gpu/driverinstaller.go

+
+func (ccm CCMode) isValid() error {
+	switch ccm {
+	case CCModeOFF, CCModeON:


Why not check DEVTOOLS mode?

As per the nvidia doc, DEVTOOLS mode is there but to it is not the output of nvidia-smi conf-compute -f command. nvidia-smi conf-compute -f only return "CC status ON or OFF" as an output. For devtools mode, we need to run nvidia-smi conf-comute -d command which would return "DevTools Mode: ON or OFF".

Currently, we use GPUCCMode() function to only check if CC mode is ON or not (which doesn't require to run nvidia-smi conf-compute -d), so DEVTOOLS mode specific check is not included in this PR, but once we use this GPUCCMode() function's output for the measurement GPU CC mode, we will extend this for DEVTOOLS mode too. This is would be part of follow-up PR having measurement related changes.

I just updated this PR with DEVTOOLS mode specific changes as well. now we do not have to make it in follow-up PR (of measurement related changes).

yawangwang · 2025-04-15T00:49:56Z

launcher/internal/gpu/driverinstaller.go

+	installerDigest := image.Target().Digest.String()
+	if err := verifyInstallerImageDigest(installerDigest); err != nil {
+		return err
+	}


Consolidate them into one function with image as input.

Also I'd recommend creating a new function verifyCGPUDriverAttestation consisting of multiple runtime verification steps according to the DD. verifyInstallerImageDigest is just step#1 of of the verification process.

Consolidate them into one function with image as input.

Consolidated the installerDigest in verifyInstallerImageDigest function.

Also I'd recommend creating a new function verifyCGPUDriverAttestation consisting of multiple runtime verification steps according to the DD. verifyInstallerImageDigest is just step#1 of of the verification process.

I have not added it under one verifyCGPUDriverAttestation function because installer digest verification is pre-installation check but driver digest verification and driver installation verification are post installation checks.

yawangwang · 2025-04-15T00:59:32Z

launcher/internal/gpu/driverinstaller.go

+
+// GetGPUCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
+// It returns the CC mode ("ON" or "OFF") and an error if the command fails or if the output cannot be parsed.
+func GetGPUCCMode() (CCMode, error) {


Remove Get prefix unless the underlying implementations issue a http GET reqeust

I have updated it to QueryCCMode. If I drop the Get prefix, it would give an lint warning because of other packages would call it gpu.GPUCCMode() and I can not keep just CCMode as it would clash with enum.

launcher/image/preload.sh

jkl73 · 2025-04-15T23:29:14Z

Please wait for lgtm from all reviewers to submit

yawangwang · 2025-04-16T00:34:15Z

launcher/internal/gpu/driverinstaller.go

+	} else {
+		return fmt.Errorf("confidential compute is not enabled for the gpu type %s", gpuType)


This is the fail-close pattern: either cc_mode is OFF or DEVTOOLS will lead to launcher exit. My understanding is launcher should measure whatever value of the CC_mode, send them to GCA, and let the relying party to decide what CC_mode is acceptable in their appraisal policy.

I added this lines while removing the non confidential GPU related tests (which were added for trusted space) but I didn't realize that we need to measure the cc_mode even in the case of OFF or DEVTOOLS for reflecting it into the token. Thanks for catching it !

yawangwang · 2025-04-16T00:37:01Z

launcher/internal/gpu/driverinstaller.go

+
+// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
+// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
+func QueryCCMode() (CCMode, error) {


Could you add unit tests for this method as it's now growing more complicated? Also it would be a good practice to add unit tests for other helper methods.

I was planning to add the unit tests for these methods but we decided that because most of the helper methods use nvidia-smi commands under the hood, it would be good to cover it via integration tests instead of mocking most of the codelines (related to containerd and nvidia-smi output). For other helper functions (which do not rely on nvidia-smi or containerd), I have added the unit tests.

You seemed to combine the original parseCCMode back to this method. Why?

Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.

e.g.,

type nvidiaSmiCmdOutput func() ([]byte, error) func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) { output, err := fn() // your implementations... }

And the caller can invoke QueryCCMode this way:

ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() }) if err != nil { ... }

This will make writing unit tests much easier, WDYT?

I have updated the helper functions signature with anonymous function argument to make it more unit testable.

You seemed to combine the original parseCCMode back to this method. Why?

That's right. The previous parseCCMode function was primarily using regex to extract the CC mode based on the 'ON' or 'OFF' values from the nvidia-smi command output. I have since updated the logic to use strings.Contains() directly within the QueryCCMode function. This simplifies the code and removes the need for a separate parsing function.

launcher/internal/gpu/driverinstaller_test.go

yawangwang · 2025-04-17T00:41:12Z

launcher/internal/gpu/driverinstaller.go

+
+// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
+// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
+func QueryCCMode() (CCMode, error) {


You seemed to combine the original parseCCMode back to this method. Why?

Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.

e.g.,

type nvidiaSmiCmdOutput func() ([]byte, error) func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) { output, err := fn() // your implementations... }

And the caller can invoke QueryCCMode this way:

ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() }) if err != nil { ... }

This will make writing unit tests much easier, WDYT?

launcher/internal/gpu/driverinstaller.go

alexmwu · 2025-04-17T17:05:59Z

launcher/image/preload.sh

+  local manifest_url
+  local image_digest
+
+  if [[ "$image_ref" =~ ^([^/]+)/([^:]+):([^:]+)$ ]]; then


This regex seems like it's a bit overly broad. What about matching the host?

Yes, it is bit broader. Reason I didn't match host or other fields is for flexibility against the installer reference updates (e.g. if cos updates moves the installer image to some other registry).

alexmwu · 2025-04-17T17:09:31Z

launcher/internal/experiments/experiments.go

-	EnableTempFSMount           bool
-	EnableGpuDriverInstallation bool
+	EnableTestFeatureForImage    bool
+	EnableTempFSMount            bool


Please remove this experiment.

Removing this would require to remove its usage references. I am planning to address this in follow-up PR which updates this branch with main branch and will resolve any merge conflicts.

launcher/internal/gpu/driverinstaller.go

alexmwu · 2025-04-17T17:19:57Z

launcher/internal/gpu/driverinstaller.go

+
+// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
+// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
+func QueryCCMode() (CCMode, error) {


launcher/internal/gpu/driverinstaller.go

launcher/image/preload.sh

alexmwu · 2025-04-18T21:57:52Z

launcher/internal/gpu/driverinstaller.go

-	deviceinfo.T4,
-	deviceinfo.A100_40GB,
-	deviceinfo.A100_80GB,
+var supportedCGPUTypes = []deviceinfo.GPUType{


Why did we remove this? How do we reconcile non-cGPU and cGPU supported devices? I'd prefer to avoid having two images

We would only be supporting confidential GPUs for CS.

yawangwang

LGTM contingent on successful cloud build tests

launcher/internal/gpu/driverinstaller_test.go

launcher/internal/gpu/driverinstaller.go

meetrajvala force-pushed the mhvcgpu1 branch 2 times, most recently from b795f8c to 7a5c0e3 Compare April 10, 2025 19:56

meetrajvala force-pushed the mhvcgpu1 branch 2 times, most recently from 4ec5f82 to e83f417 Compare April 11, 2025 07:02

Add cos_gpu_installer digest verification

377f9cb

meetrajvala force-pushed the mhvcgpu1 branch from e83f417 to 377f9cb Compare April 11, 2025 10:06

meetrajvala requested review from alexmwu, jkl73 and yawangwang April 11, 2025 11:07

yawangwang reviewed Apr 15, 2025

View reviewed changes

meetrajvala requested a review from yawangwang April 15, 2025 09:38

meetrajvala force-pushed the mhvcgpu1 branch from d7a0cca to d236e4c Compare April 15, 2025 20:23

jkl73 approved these changes Apr 15, 2025

View reviewed changes

launcher/image/preload.sh Outdated Show resolved Hide resolved

yawangwang reviewed Apr 16, 2025

View reviewed changes

meetrajvala force-pushed the mhvcgpu1 branch 4 times, most recently from 75e8f5b to 981f262 Compare April 16, 2025 09:51

meetrajvala requested a review from yawangwang April 16, 2025 10:02

meetrajvala force-pushed the mhvcgpu1 branch 3 times, most recently from d816027 to 25652bd Compare April 16, 2025 21:38

yawangwang reviewed Apr 17, 2025

View reviewed changes

alexmwu reviewed Apr 17, 2025

View reviewed changes

meetrajvala force-pushed the mhvcgpu1 branch 4 times, most recently from f6c5c8c to 6eac2c9 Compare April 18, 2025 19:42

meetrajvala force-pushed the mhvcgpu1 branch 4 times, most recently from cd8dd05 to 6d96908 Compare April 18, 2025 20:03

meetrajvala requested review from alexmwu and yawangwang April 18, 2025 20:08

meetrajvala force-pushed the mhvcgpu1 branch from 6d96908 to a6de8e6 Compare April 18, 2025 20:26

alexmwu approved these changes Apr 18, 2025

View reviewed changes

yawangwang approved these changes Apr 18, 2025

View reviewed changes

launcher/internal/gpu/driverinstaller_test.go Outdated Show resolved Hide resolved

launcher/internal/gpu/driverinstaller.go Outdated Show resolved Hide resolved

address review comments

1ac5451

meetrajvala force-pushed the mhvcgpu1 branch from a6de8e6 to 1ac5451 Compare April 21, 2025 21:32

meetrajvala merged commit 786f494 into cs_cgpu_h100 Apr 22, 2025
11 checks passed

		} else {
		return fmt.Errorf("confidential compute is not enabled for the gpu type %s", gpuType)

Conversation

meetrajvala commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meetrajvala commented Apr 10, 2025

Uh oh!

yawangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meetrajvala Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkl73 commented Apr 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yawangwang Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meetrajvala Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meetrajvala Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meetrajvala commented Apr 10, 2025 •

edited

Loading

meetrajvala Apr 15, 2025 •

edited

Loading

yawangwang Apr 16, 2025 •

edited

Loading

meetrajvala Apr 18, 2025 •

edited

Loading

meetrajvala Apr 18, 2025 •

edited

Loading