Add GPU driver installer digest verification as part of confidential GPU driver installation flow#565
Conversation
b795f8c to
7a5c0e3
Compare
|
/gcbrun |
4ec5f82 to
e83f417
Compare
yawangwang
left a comment
There was a problem hiding this comment.
Could you also add what changes you plan to make in the PR description?
|
|
||
| func (ccm CCMode) isValid() error { | ||
| switch ccm { | ||
| case CCModeOFF, CCModeON: |
There was a problem hiding this comment.
Why not check DEVTOOLS mode?
There was a problem hiding this comment.
As per the nvidia doc, DEVTOOLS mode is there but to it is not the output of nvidia-smi conf-compute -f command. nvidia-smi conf-compute -f only return "CC status ON or OFF" as an output. For devtools mode, we need to run nvidia-smi conf-comute -d command which would return "DevTools Mode: ON or OFF".
Currently, we use GPUCCMode() function to only check if CC mode is ON or not (which doesn't require to run nvidia-smi conf-compute -d), so DEVTOOLS mode specific check is not included in this PR, but once we use this GPUCCMode() function's output for the measurement GPU CC mode, we will extend this for DEVTOOLS mode too. This is would be part of follow-up PR having measurement related changes.
There was a problem hiding this comment.
I just updated this PR with DEVTOOLS mode specific changes as well. now we do not have to make it in follow-up PR (of measurement related changes).
| installerDigest := image.Target().Digest.String() | ||
| if err := verifyInstallerImageDigest(installerDigest); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Consolidate them into one function with image as input.
There was a problem hiding this comment.
Also I'd recommend creating a new function verifyCGPUDriverAttestation consisting of multiple runtime verification steps according to the DD. verifyInstallerImageDigest is just step#1 of of the verification process.
There was a problem hiding this comment.
Consolidate them into one function with image as input.
Consolidated the installerDigest in verifyInstallerImageDigest function.
There was a problem hiding this comment.
Also I'd recommend creating a new function
verifyCGPUDriverAttestationconsisting of multiple runtime verification steps according to the DD.verifyInstallerImageDigestis just step#1 of of the verification process.
I have not added it under one verifyCGPUDriverAttestation function because installer digest verification is pre-installation check but driver digest verification and driver installation verification are post installation checks.
|
|
||
| // GetGPUCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU. | ||
| // It returns the CC mode ("ON" or "OFF") and an error if the command fails or if the output cannot be parsed. | ||
| func GetGPUCCMode() (CCMode, error) { |
There was a problem hiding this comment.
Remove Get prefix unless the underlying implementations issue a http GET reqeust
There was a problem hiding this comment.
I have updated it to QueryCCMode. If I drop the Get prefix, it would give an lint warning because of other packages would call it gpu.GPUCCMode() and I can not keep just CCMode as it would clash with enum.
|
Please wait for lgtm from all reviewers to submit |
| } else { | ||
| return fmt.Errorf("confidential compute is not enabled for the gpu type %s", gpuType) |
There was a problem hiding this comment.
This is the fail-close pattern: either cc_mode is OFF or DEVTOOLS will lead to launcher exit. My understanding is launcher should measure whatever value of the CC_mode, send them to GCA, and let the relying party to decide what CC_mode is acceptable in their appraisal policy.
There was a problem hiding this comment.
I added this lines while removing the non confidential GPU related tests (which were added for trusted space) but I didn't realize that we need to measure the cc_mode even in the case of OFF or DEVTOOLS for reflecting it into the token. Thanks for catching it !
|
|
||
| // QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU. | ||
| // If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON. | ||
| func QueryCCMode() (CCMode, error) { |
There was a problem hiding this comment.
Could you add unit tests for this method as it's now growing more complicated? Also it would be a good practice to add unit tests for other helper methods.
There was a problem hiding this comment.
I was planning to add the unit tests for these methods but we decided that because most of the helper methods use nvidia-smi commands under the hood, it would be good to cover it via integration tests instead of mocking most of the codelines (related to containerd and nvidia-smi output). For other helper functions (which do not rely on nvidia-smi or containerd), I have added the unit tests.
There was a problem hiding this comment.
You seemed to combine the original parseCCMode back to this method. Why?
Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.
e.g.,
type nvidiaSmiCmdOutput func() ([]byte, error)
func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) {
output, err := fn()
// your implementations...
}
And the caller can invoke QueryCCMode this way:
ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() })
if err != nil {
...
}
This will make writing unit tests much easier, WDYT?
There was a problem hiding this comment.
I have updated the helper functions signature with anonymous function argument to make it more unit testable.
There was a problem hiding this comment.
You seemed to combine the original
parseCCModeback to this method. Why?
That's right. The previous parseCCMode function was primarily using regex to extract the CC mode based on the 'ON' or 'OFF' values from the nvidia-smi command output. I have since updated the logic to use strings.Contains() directly within the QueryCCMode function. This simplifies the code and removes the need for a separate parsing function.
75e8f5b to
981f262
Compare
d816027 to
25652bd
Compare
|
|
||
| // QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU. | ||
| // If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON. | ||
| func QueryCCMode() (CCMode, error) { |
There was a problem hiding this comment.
You seemed to combine the original parseCCMode back to this method. Why?
Ofc integration tests can cover nvidia-smi usages, but unit tests are faster and can help early detect any issues. Plus, mocking the result of nvidia-smi command shouldn't take too much work, all it takes is change the method signature a bit.
e.g.,
type nvidiaSmiCmdOutput func() ([]byte, error)
func QueryCCMode(fn nvidiaSmiCmdOutput) (CCMode, error) {
output, err := fn()
// your implementations...
}
And the caller can invoke QueryCCMode this way:
ccMode, err := QueryCCMode(func() ([]byte, error) { return nvidiaSmiCmd.Output() })
if err != nil {
...
}
This will make writing unit tests much easier, WDYT?
| local manifest_url | ||
| local image_digest | ||
|
|
||
| if [[ "$image_ref" =~ ^([^/]+)/([^:]+):([^:]+)$ ]]; then |
There was a problem hiding this comment.
This regex seems like it's a bit overly broad. What about matching the host?
There was a problem hiding this comment.
Yes, it is bit broader. Reason I didn't match host or other fields is for flexibility against the installer reference updates (e.g. if cos updates moves the installer image to some other registry).
| EnableTempFSMount bool | ||
| EnableGpuDriverInstallation bool | ||
| EnableTestFeatureForImage bool | ||
| EnableTempFSMount bool |
There was a problem hiding this comment.
Please remove this experiment.
There was a problem hiding this comment.
Removing this would require to remove its usage references. I am planning to address this in follow-up PR which updates this branch with main branch and will resolve any merge conflicts.
|
|
||
| // QueryCCMode executes nvidia-smi to determine the current Confidential Computing (CC) mode status of the GPU. | ||
| // If DEVTOOLS mode is enabled, it would override CC mode as DEVTOOLS. DEVTOOLS mode would be enabled only when CC mode is ON. | ||
| func QueryCCMode() (CCMode, error) { |
f6c5c8c to
6eac2c9
Compare
cd8dd05 to
6d96908
Compare
| deviceinfo.T4, | ||
| deviceinfo.A100_40GB, | ||
| deviceinfo.A100_80GB, | ||
| var supportedCGPUTypes = []deviceinfo.GPUType{ |
There was a problem hiding this comment.
Why did we remove this? How do we reconcile non-cGPU and cGPU supported devices? I'd prefer to avoid having two images
There was a problem hiding this comment.
We would only be supporting confidential GPUs for CS.
yawangwang
left a comment
There was a problem hiding this comment.
LGTM contingent on successful cloud build tests
Changes:
This PR contains the following changes:
cos_gpu_installerimage digest verification check before launching the installer container. For image reference, it now refers to the file stored in OEM partition instead of finding the image reference at runtime.preload.shfile to add thecos_gpu_installerimage digest and image reference files under OEM partition. Here manifests API is used to get the image digest (for given image reference) because we do not havedockerandgcloudavailable in the build container.EnableConfidentialGPUSupport) asEnableGpuDriverInstallation(existing flag) is being used by trusted space and it would be good to maintain these under different feature flags.Testing:
Changes which would be part of follow-up PR: