Skip to content

feat(gcp): add GPU guest accelerator support for GCP nodepools#1952

Merged
samuelstolicny merged 9 commits into
masterfrom
feature/1845-gcp-gpu-guest-accelerator
Feb 9, 2026
Merged

feat(gcp): add GPU guest accelerator support for GCP nodepools#1952
samuelstolicny merged 9 commits into
masterfrom
feature/1845-gcp-gpu-guest-accelerator

Conversation

@samuelstolicny
Copy link
Copy Markdown
Contributor

@samuelstolicny samuelstolicny commented Jan 27, 2026

Summary

Closes #1845

This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.

Changes

  • Proto: Added nvidiaGpuCount (renamed from nvidiaGpu) and nvidiaGpuType fields to MachineSpec
  • Manifest: Updated MachineSpec struct with new GPU fields and backward compatibility for nvidiaGpu
  • Validation: Added validateGCPGpuConfig() to ensure GCP nodepools with GPUs have the required type specified
  • CRD: Regenerated with new nvidiaGpuCount and nvidiaGpuType fields
  • Autoscaler: Updated to use NvidiaGpuCount field name
  • Documentation: Added GPU configuration docs for GCP provider
  • Makefile: Added kind-deploy target for easier local testing

Example Usage

nodePools:
  dynamic:
    - name: gpu-workers
      providerSpec:
        name: gcp-provider
        region: europe-west1
        zone: europe-west1-b
      count: 1
      serverType: n1-standard-4
      image: ubuntu-2204-lts
      machineSpec:
        nvidiaGpuCount: 1
        nvidiaGpuType: nvidia-tesla-t4

Backward Compatibility

  • The old nvidiaGpu field is preserved as a deprecated alias for nvidiaGpuCount
  • Non-GCP providers can still use GPU count without specifying type
  • Proto field numbers maintained for wire compatibility

Template Changes

GCP template changes implemented in berops/claudie-config#22

Summary by CodeRabbit

  • New Features

    • GCP now supports NVIDIA GPUs with configurable GPU count and explicit GPU type; deprecated GPU field retained for compatibility.
    • Local "kind" cluster deploy target added to update and rollout workloads.
  • Documentation

    • Added GCP GPU docs, examples, provider guidance and a GPU Operator deployment guide; updated provider matrix.
  • Validation

    • GCP-specific validation requires GPU type when GPUs are requested.
  • Tests

    • Added validation tests covering GCP and non‑GCP GPU scenarios.

Samuel STOLICNY (contractor) added 2 commits January 27, 2026 16:24
Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds GPU count/type fields to MachineSpec (preserving deprecated field), enforces GCP-specific validation requiring GPU type when GPU count > 0, updates protobuf/CRD/schema and manifest mapping, adjusts autoscaler to use NvidiaGpuCount, expands GPU docs/examples, and adds a Makefile kind-deploy target.

Changes

Cohort / File(s) Summary
Build & CI
Makefile
Added kind-deploy PHONY target; introduced KIND_CLUSTER and KIND_NAMESPACE; updated kind-load-images to use --name $(KIND_CLUSTER); kind-deploy updates images in $(KIND_NAMESPACE) and waits for rollouts.
API Schema & CRD
proto/spec/nodepool.proto, manifests/claudie/crd/claudie.io_inputmanifests.yaml, internal/api/manifest/manifest.go
Replaced nvidiaGpu with nvidiaGpuCount and added nvidiaGpuType; retained deprecated nvidiaGpu; added memory in proto; CRD updated to include new fields and bumped kubeone version.
Manifest Mapping
internal/api/manifest/utils.go
Prefer NvidiaGpuCount with fallback to deprecated NvidiaGpu; populate NvidiaGpuCount and NvidiaGpuType into the public MachineSpec.
Validation & Tests
internal/api/manifest/validate_node_pool.go, internal/api/manifest/validate_test.go
Added GCP-specific validation: if provider is GCP and GPU count > 0, require NvidiaGpuType; added unit tests covering GCP/non‑GCP and deprecated-field fallback cases.
Autoscaler Integration
services/autoscaler-adapter/node_manager/...
Autoscaler/node-manager now reads NvidiaGpuCount (overrides cached GPU count when >0); minor comment updates.
Documentation & Examples
README.md, docs/input-manifest/api-reference.md, docs/input-manifest/gpu-example.md, docs/input-manifest/providers/gcp.md, docs/autoscaling/autoscaling.md
Documented new GPU fields and GCP requirements; added GCP GPU examples and GPU Operator deployment steps; marked GCP as GPU-supported; updated parameter name in autoscaling docs.
Manifests / Images
manifests/claudie/kustomization.yaml, manifests/testing-framework/kustomization.yaml
Bumped image tags across kustomizations.
Misc (small)
internal/api/manifest/manifest.go (comments), services/autoscaler-adapter/node_manager/utils.go
Comment and minor references updated to reflect NvidiaGpuCount naming.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant User as User (submits manifest)
participant API as API Server
participant Validator as Manifest Validator
participant Mapper as Manifest Mapper
participant Cluster as Cloud Provider (GCP/AWS)
participant Autoscaler as Autoscaler Adapter

User->>API: Submit InputManifest (machineSpec w/ GPU fields)
API->>Validator: Validate nodepools
Validator->>Validator: If provider==GCP and gpuCount>0 require gpuType
Validator-->>API: Validation result (ok / error)
API->>Mapper: Map manifest -> protobuf MachineSpec (nvidiaGpuCount, nvidiaGpuType)
Mapper-->>Cluster: Create/update nodepool
Cluster-->>Autoscaler: Nodes report capacity (includes GPU count)
Autoscaler->>Autoscaler: Compute scaling using nvidiaGpuCount

Possibly related PRs

Suggested labels

test-set-autoscaling, documentation

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding GPU guest accelerator support for GCP nodepools, which aligns with the substantial refactoring of GPU fields across the codebase.
Linked Issues check ✅ Passed The PR successfully implements all core objectives from issue #1845: extended MachineSpec with nvidiaGpuCount/nvidiaGpuType, added GCP-specific validation, maintained backward compatibility with the deprecated nvidiaGpu field, updated documentation with GPU configuration examples, and ensured proto/CRD compatibility.
Out of Scope Changes check ✅ Passed The Makefile kind-deploy target and image tag updates in kustomization files are minor infrastructure improvements that support the GPU implementation testing and deployment workflow, while remaining aligned with the overall PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/1845-gcp-gpu-guest-accelerator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@samuelstolicny samuelstolicny marked this pull request as ready for review January 28, 2026 08:39
@samuelstolicny samuelstolicny added feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger. labels Jan 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.
🧹 Nitpick comments (3)
Makefile (1)

95-102: Fail fast in kind-deploy and scope rollout to updated deployments.

Line 97–102: as written, a failed kubectl set image inside the loop can be masked, and the final kubectl rollout status deployment -n $(KIND_NAMESPACE) can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.

♻️ Suggested update
 kind-deploy: kind-load-images
 	`@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- "
-	`@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
+	`@set` -e; \
+	for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
 		echo " --- updating $$svc deployment --- "; \
 		kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \
+		kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \
 	done
-	`@echo` " --- waiting for rollouts to complete --- "
-	`@kubectl` rollout status deployment -n $(KIND_NAMESPACE)
+	`@echo` " --- rollouts completed --- "
internal/api/manifest/validate_node_pool.go (1)

145-170: Clarify error text for deprecated GPU count usage.

If users still set nvidiaGpu (deprecated), the current message may be confusing. Consider naming both fields.

🛠️ Suggested tweak
-		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0")
+		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")
internal/api/manifest/validate_test.go (1)

332-440: Consider adding test case for GCP with deprecated NvidiaGpu field.

The test covers the new NvidiaGpuCount field well for GCP and validates backward compatibility for non-GCP providers using the deprecated NvidiaGpu field. However, there's a gap: what happens when a GCP nodepool uses the deprecated NvidiaGpu field without specifying NvidiaGpuType?

If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.

Suggested additional test case
 	r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation")
+
+	// Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used)
+	gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{
+		Name:       "gpu-np-dep",
+		ServerType: "n1-standard-4",
+		Image:      "ubuntu-2204",
+		Count:      1,
+		ProviderSpec: ProviderSpec{
+			Name:   "gcp-1",
+			Region: "us-central1",
+			Zone:   "us-central1-a",
+		},
+		MachineSpec: &MachineSpec{
+			NvidiaGpu: 1, // Using deprecated field
+		},
+	}
+	r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation")
 }

Comment thread docs/input-manifest/providers/gcp.md
Comment thread internal/api/manifest/manifest.go
Comment thread internal/api/manifest/validate_node_pool.go
Comment thread proto/spec/nodepool.proto
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@internal/api/manifest/validate_node_pool.go`:
- Around line 161-169: The GPU-type presence check currently treats
whitespace-only strings as valid; update the validation in validate_node_pool
(where gpuCount is computed from d.MachineSpec.NvidiaGpuCount / NvidiaGpu) to
trim d.MachineSpec.NvidiaGpuType (e.g., using strings.TrimSpace) before testing
emptiness and return the same error when the trimmed value is empty; reference
the gpuCount variable and d.MachineSpec.NvidiaGpuType to locate the check and
ensure whitespace-only values are rejected.

Comment thread internal/api/manifest/validate_node_pool.go
Copy link
Copy Markdown
Contributor

@Despire Despire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Please also replace the old field here: https://github.com/berops/claudie/blob/master/docs/autoscaling/autoscaling.md

After we can merge.

Comment thread internal/api/manifest/validate_node_pool.go
Comment thread internal/api/manifest/manifest.go
Comment thread internal/api/manifest/manifest.go
@samuelstolicny samuelstolicny removed the request for review from jakubhlavacka February 9, 2026 10:16
@samuelstolicny samuelstolicny added this pull request to the merge queue Feb 9, 2026
Merged via the queue into master with commit 2a4c922 Feb 9, 2026
@samuelstolicny samuelstolicny deleted the feature/1845-gcp-gpu-guest-accelerator branch February 9, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Enable GPU acceleration for instances where instance-type does not automatically enable the GPU's

3 participants