feat(gcp): add GPU guest accelerator support for GCP nodepools#1952
Conversation
Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds GPU count/type fields to MachineSpec (preserving deprecated field), enforces GCP-specific validation requiring GPU type when GPU count > 0, updates protobuf/CRD/schema and manifest mapping, adjusts autoscaler to use Changes
Sequence Diagram(s)mermaid Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.
🧹 Nitpick comments (3)
Makefile (1)
95-102: Fail fast in kind-deploy and scope rollout to updated deployments.Line 97–102: as written, a failed
kubectl set imageinside the loop can be masked, and the finalkubectl rollout status deployment -n $(KIND_NAMESPACE)can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.♻️ Suggested update
kind-deploy: kind-load-images `@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- " - `@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \ + `@set` -e; \ + for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \ echo " --- updating $$svc deployment --- "; \ kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \ + kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \ done - `@echo` " --- waiting for rollouts to complete --- " - `@kubectl` rollout status deployment -n $(KIND_NAMESPACE) + `@echo` " --- rollouts completed --- "internal/api/manifest/validate_node_pool.go (1)
145-170: Clarify error text for deprecated GPU count usage.If users still set
nvidiaGpu(deprecated), the current message may be confusing. Consider naming both fields.🛠️ Suggested tweak
- return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0") + return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")internal/api/manifest/validate_test.go (1)
332-440: Consider adding test case for GCP with deprecatedNvidiaGpufield.The test covers the new
NvidiaGpuCountfield well for GCP and validates backward compatibility for non-GCP providers using the deprecatedNvidiaGpufield. However, there's a gap: what happens when a GCP nodepool uses the deprecatedNvidiaGpufield without specifyingNvidiaGpuType?If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.
Suggested additional test case
r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation") + + // Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used) + gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{ + Name: "gpu-np-dep", + ServerType: "n1-standard-4", + Image: "ubuntu-2204", + Count: 1, + ProviderSpec: ProviderSpec{ + Name: "gcp-1", + Region: "us-central1", + Zone: "us-central1-a", + }, + MachineSpec: &MachineSpec{ + NvidiaGpu: 1, // Using deprecated field + }, + } + r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation") }
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/api/manifest/validate_node_pool.go`:
- Around line 161-169: The GPU-type presence check currently treats
whitespace-only strings as valid; update the validation in validate_node_pool
(where gpuCount is computed from d.MachineSpec.NvidiaGpuCount / NvidiaGpu) to
trim d.MachineSpec.NvidiaGpuType (e.g., using strings.TrimSpace) before testing
emptiness and return the same error when the trimmed value is empty; reference
the gpuCount variable and d.MachineSpec.NvidiaGpuType to locate the check and
ensure whitespace-only values are rejected.
Despire
left a comment
There was a problem hiding this comment.
LGTM 👍
Please also replace the old field here: https://github.com/berops/claudie/blob/master/docs/autoscaling/autoscaling.md
After we can merge.
Summary
Closes #1845
This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.
Changes
nvidiaGpuCount(renamed fromnvidiaGpu) andnvidiaGpuTypefields toMachineSpecMachineSpecstruct with new GPU fields and backward compatibility fornvidiaGpuvalidateGCPGpuConfig()to ensure GCP nodepools with GPUs have the required type specifiednvidiaGpuCountandnvidiaGpuTypefieldsNvidiaGpuCountfield namekind-deploytarget for easier local testingExample Usage
Backward Compatibility
nvidiaGpufield is preserved as a deprecated alias fornvidiaGpuCountTemplate Changes
GCP template changes implemented in berops/claudie-config#22
Summary by CodeRabbit
New Features
Documentation
Validation
Tests