Skip to content

Add GPU driver installer digest verification as part of confidential GPU driver installation flow#565

Merged
meetrajvala merged 2 commits intocs_cgpu_h100from
mhvcgpu1
Apr 22, 2025
Merged

Add GPU driver installer digest verification as part of confidential GPU driver installation flow#565
meetrajvala merged 2 commits intocs_cgpu_h100from
mhvcgpu1

Conversation

@meetrajvala
Copy link
Copy Markdown
Contributor

@meetrajvala meetrajvala commented Apr 10, 2025

Changes:

This PR contains the following changes:

  • Adds cos_gpu_installer image digest verification check before launching the installer container. For image reference, it now refers to the file stored in OEM partition instead of finding the image reference at runtime.
  • Changes in preload.sh file to add the cos_gpu_installer image digest and image reference files under OEM partition. Here manifests API is used to get the image digest (for given image reference) because we do not have docker and gcloud available in the build container.
  • Update the experiment flag to new one (named EnableConfidentialGPUSupport) as EnableGpuDriverInstallation (existing flag) is being used by trusted space and it would be good to maintain these under different feature flags.
  • Update existing image tests to be more specific for confidential GPU.

Testing:

  • Manual testing
  • Existing image tests for confidential GPU ran successfully.
  • Unit tests for relevant helper methods

Changes which would be part of follow-up PR:

  • Measure GPU CC mode status.
  • Add driver hash file under OEM partition as part of image build process and use it to verify the installed drivers at runtime.
  • Update/Add relevant image tests

@meetrajvala meetrajvala force-pushed the mhvcgpu1 branch 2 times, most recently from b795f8c to 7a5c0e3 Compare April 10, 2025 19:56
@meetrajvala
Copy link
Copy Markdown
Contributor Author

/gcbrun

@meetrajvala meetrajvala force-pushed the mhvcgpu1 branch 2 times, most recently from 4ec5f82 to e83f417 Compare April 11, 2025 07:02
Copy link
Copy Markdown
Collaborator

@yawangwang yawangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add what changes you plan to make in the PR description?


func (ccm CCMode) isValid() error {
switch ccm {
case CCModeOFF, CCModeON:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check DEVTOOLS mode?

Copy link
Copy Markdown
Contributor Author

@meetrajvala meetrajvala Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the nvidia doc, DEVTOOLS mode is there but to it is not the output of nvidia-smi conf-compute -f command. nvidia-smi conf-compute -f only return "CC status ON or OFF" as an output. For devtools mode, we need to run nvidia-smi conf-comute -d command which would return "DevTools Mode: ON or OFF".

Currently, we use GPUCCMode() function to only check if CC mode is ON or not (which doesn't require to run nvidia-smi conf-compute -d), so DEVTOOLS mode specific check is not included in this PR, but once we use this GPUCCMode() function's output for the measurement GPU CC mode, we will extend this for DEVTOOLS mode too. This is would be part of follow-up PR having measurement related changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated this PR with DEVTOOLS mode specific changes as well. now we do not have to make it in follow-up PR (of measurement related changes).

Comment on lines +93 to +96
installerDigest := image.Target().Digest.String()
if err := verifyInstallerImageDigest(installerDigest); err != nil {
return err
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidate them into one function with image as input.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd recommend creating a new function verifyCGPUDriverAttestation consisting of multiple runtime verification steps according to the DD. verifyInstallerImageDigest is just step#1 of of the verification process.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidate them into one function with image as input.

Consolidated the installerDigest in verifyInstallerImageDigest function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd recommend creating a new function verifyCGPUDriverAttestation consisting of multiple runtime verification steps according to the DD. verifyInstallerImageDigest is just step#1 of of the verification process.

I have not added it under one verifyCGPUDriverAttestation function because installer digest verification is pre-installation check but driver digest verification and driver installation verification are post installation checks.


// GetGPUCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
// It returns the CC mode ("ON" or "OFF") and an error if the command fails or if the output cannot be parsed.
func GetGPUCCMode() (CCMode, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Get prefix unless the underlying implementations issue a http GET reqeust

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it to QueryCCMode. If I drop the Get prefix, it would give an lint warning because of other packages would call it gpu.GPUCCMode() and I can not keep just CCMode as it would clash with enum.

@jkl73
Copy link
Copy Markdown
Contributor

jkl73 commented Apr 15, 2025

Please wait for lgtm from all reviewers to submit

Comment on lines +181 to +182
} else {
return fmt.Errorf("confidential compute is not enabled for the gpu type %s", gpuType)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fail-close pattern: either cc_mode is OFF or DEVTOOLS will lead to launcher exit. My understanding is launcher should measure whatever value of the CC_mode, send them to GCA, and let the relying party to decide what CC_mode is acceptable in their appraisal policy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this lines while removing the non confidential GPU related tests (which were added for trusted space) but I didn't realize that we need to measure the cc_mode even in the case of OFF or DEVTOOLS for reflecting it into the token. Thanks for catching it !


// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
func QueryCCMode() (CCMode, error) {
Copy link
Copy Markdown
Collaborator

@yawangwang yawangwang Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit tests for this method as it's now growing more complicated? Also it would be a good practice to add unit tests for other helper methods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to add the unit tests for these methods but we decided that because most of the helper methods use nvidia-smi commands under the hood, it would be good to cover it via integration tests instead of mocking most of the codelines (related to containerd and nvidia-smi output). For other helper functions (which do not rely on nvidia-smi or containerd), I have added the unit tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seemed to combine the original parseCCMode back to this method. Why?

Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.

e.g.,

type nvidiaSmiCmdOutput func() ([]byte, error)

func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) {
    output, err := fn()
    // your implementations...
}

And the caller can invoke QueryCCMode this way:

ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() })
if err != nil {
...
}

This will make writing unit tests much easier, WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the helper functions signature with anonymous function argument to make it more unit testable.

Copy link
Copy Markdown
Contributor Author

@meetrajvala meetrajvala Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seemed to combine the original parseCCMode back to this method. Why?

That's right. The previous parseCCMode function was primarily using regex to extract the CC mode based on the 'ON' or 'OFF' values from the nvidia-smi command output. I have since updated the logic to use strings.Contains() directly within the QueryCCMode function. This simplifies the code and removes the need for a separate parsing function.

@meetrajvala meetrajvala force-pushed the mhvcgpu1 branch 4 times, most recently from 75e8f5b to 981f262 Compare April 16, 2025 09:51
@meetrajvala meetrajvala requested a review from yawangwang April 16, 2025 10:02
@meetrajvala meetrajvala force-pushed the mhvcgpu1 branch 3 times, most recently from d816027 to 25652bd Compare April 16, 2025 21:38

// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
func QueryCCMode() (CCMode, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seemed to combine the original parseCCMode back to this method. Why?

Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.

e.g.,

type nvidiaSmiCmdOutput func() ([]byte, error)

func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) {
    output, err := fn()
    // your implementations...
}

And the caller can invoke QueryCCMode this way:

ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() })
if err != nil {
...
}

This will make writing unit tests much easier, WDYT?

local manifest_url
local image_digest

if [[ "$image_ref" =~ ^([^/]+)/([^:]+):([^:]+)$ ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex seems like it's a bit overly broad. What about matching the host?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is bit broader. Reason I didn't match host or other fields is for flexibility against the installer reference updates (e.g. if cos updates moves the installer image to some other registry).

EnableTempFSMount bool
EnableGpuDriverInstallation bool
EnableTestFeatureForImage bool
EnableTempFSMount bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this experiment.

Copy link
Copy Markdown
Contributor Author

@meetrajvala meetrajvala Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this would require to remove its usage references. I am planning to address this in follow-up PR which updates this branch with main branch and will resolve any merge conflicts.


// QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU.
// If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON.
func QueryCCMode() (CCMode, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@meetrajvala meetrajvala force-pushed the mhvcgpu1 branch 4 times, most recently from f6c5c8c to 6eac2c9 Compare April 18, 2025 19:42
deviceinfo.T4,
deviceinfo.A100_40GB,
deviceinfo.A100_80GB,
var supportedCGPUTypes = []deviceinfo.GPUType{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we remove this? How do we reconcile non-cGPU and cGPU supported devices? I'd prefer to avoid having two images

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would only be supporting confidential GPUs for CS.

Copy link
Copy Markdown
Collaborator

@yawangwang yawangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM contingent on successful cloud build tests

@meetrajvala meetrajvala merged commit 786f494 into cs_cgpu_h100 Apr 22, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants