feat(gcp): add GPU guest accelerator support for GCP nodepools by samuelstolicny · Pull Request #1952 · berops/claudie

samuelstolicny · 2026-01-27T15:53:19Z

Summary

This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.

Changes

Proto: Added nvidiaGpuCount (renamed from nvidiaGpu) and nvidiaGpuType fields to MachineSpec
Manifest: Updated MachineSpec struct with new GPU fields and backward compatibility for nvidiaGpu
Validation: Added validateGCPGpuConfig() to ensure GCP nodepools with GPUs have the required type specified
CRD: Regenerated with new nvidiaGpuCount and nvidiaGpuType fields
Autoscaler: Updated to use NvidiaGpuCount field name
Documentation: Added GPU configuration docs for GCP provider
Makefile: Added kind-deploy target for easier local testing

Example Usage

nodePools:
  dynamic:
    - name: gpu-workers
      providerSpec:
        name: gcp-provider
        region: europe-west1
        zone: europe-west1-b
      count: 1
      serverType: n1-standard-4
      image: ubuntu-2204-lts
      machineSpec:
        nvidiaGpuCount: 1
        nvidiaGpuType: nvidia-tesla-t4

Backward Compatibility

The old nvidiaGpu field is preserved as a deprecated alias for nvidiaGpuCount
Non-GCP providers can still use GPU count without specifying type
Proto field numbers maintained for wire compatibility

Template Changes

GCP template changes implemented in berops/claudie-config#22

Summary by CodeRabbit

New Features
- GCP now supports NVIDIA GPUs with configurable GPU count and explicit GPU type; deprecated GPU field retained for compatibility.
- Local "kind" cluster deploy target added to update and rollout workloads.
Documentation
- Added GCP GPU docs, examples, provider guidance and a GPU Operator deployment guide; updated provider matrix.
Validation
- GCP-specific validation requires GPU type when GPUs are requested.
Tests
- Added validation tests covering GCP and non‑GCP GPU scenarios.

Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.

… kind cluster

coderabbitai · 2026-01-27T15:53:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds GPU count/type fields to MachineSpec (preserving deprecated field), enforces GCP-specific validation requiring GPU type when GPU count > 0, updates protobuf/CRD/schema and manifest mapping, adjusts autoscaler to use NvidiaGpuCount, expands GPU docs/examples, and adds a Makefile kind-deploy target.

Changes

Cohort / File(s)	Summary
Build & CI `Makefile`	Added `kind-deploy` PHONY target; introduced `KIND_CLUSTER` and `KIND_NAMESPACE`; updated `kind-load-images` to use `--name $(KIND_CLUSTER)`; `kind-deploy` updates images in `$(KIND_NAMESPACE)` and waits for rollouts.
API Schema & CRD `proto/spec/nodepool.proto`, `manifests/claudie/crd/claudie.io_inputmanifests.yaml`, `internal/api/manifest/manifest.go`	Replaced `nvidiaGpu` with `nvidiaGpuCount` and added `nvidiaGpuType`; retained deprecated `nvidiaGpu`; added `memory` in proto; CRD updated to include new fields and bumped kubeone version.
Manifest Mapping `internal/api/manifest/utils.go`	Prefer `NvidiaGpuCount` with fallback to deprecated `NvidiaGpu`; populate `NvidiaGpuCount` and `NvidiaGpuType` into the public MachineSpec.
Validation & Tests `internal/api/manifest/validate_node_pool.go`, `internal/api/manifest/validate_test.go`	Added GCP-specific validation: if provider is GCP and GPU count > 0, require `NvidiaGpuType`; added unit tests covering GCP/non‑GCP and deprecated-field fallback cases.
Autoscaler Integration `services/autoscaler-adapter/node_manager/...`	Autoscaler/node-manager now reads `NvidiaGpuCount` (overrides cached GPU count when >0); minor comment updates.
Documentation & Examples `README.md`, `docs/input-manifest/api-reference.md`, `docs/input-manifest/gpu-example.md`, `docs/input-manifest/providers/gcp.md`, `docs/autoscaling/autoscaling.md`	Documented new GPU fields and GCP requirements; added GCP GPU examples and GPU Operator deployment steps; marked GCP as GPU-supported; updated parameter name in autoscaling docs.
Manifests / Images `manifests/claudie/kustomization.yaml`, `manifests/testing-framework/kustomization.yaml`	Bumped image tags across kustomizations.
Misc (small) `internal/api/manifest/manifest.go` (comments), `services/autoscaler-adapter/node_manager/utils.go`	Comment and minor references updated to reflect `NvidiaGpuCount` naming.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant User as User (submits manifest)
participant API as API Server
participant Validator as Manifest Validator
participant Mapper as Manifest Mapper
participant Cluster as Cloud Provider (GCP/AWS)
participant Autoscaler as Autoscaler Adapter

User->>API: Submit InputManifest (machineSpec w/ GPU fields)
API->>Validator: Validate nodepools
Validator->>Validator: If provider==GCP and gpuCount>0 require gpuType
Validator-->>API: Validation result (ok / error)
API->>Mapper: Map manifest -> protobuf MachineSpec (nvidiaGpuCount, nvidiaGpuType)
Mapper-->>Cluster: Create/update nodepool
Cluster-->>Autoscaler: Nodes report capacity (includes GPU count)
Autoscaler->>Autoscaler: Compute scaling using nvidiaGpuCount

Possibly related PRs

docs: add GPU support column to supported providers table #1943 — README providers table edits; directly related to marking GCP as GPU-supported.
chore: increase concurrency limits across services #1819 — Changes to validate_node_pool.go; may overlap with validation logic and tests.
chore: expand machine spec to contain number of gpus #1854 — Related MachineSpec GPU field and autoscaler adjustments; overlaps schema and autoscaler updates.

Suggested labels

test-set-autoscaling, documentation

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: adding GPU guest accelerator support for GCP nodepools, which aligns with the substantial refactoring of GPU fields across the codebase.
Linked Issues check	✅ Passed	The PR successfully implements all core objectives from issue `#1845`: extended MachineSpec with nvidiaGpuCount/nvidiaGpuType, added GCP-specific validation, maintained backward compatibility with the deprecated nvidiaGpu field, updated documentation with GPU configuration examples, and ensured proto/CRD compatibility.
Out of Scope Changes check	✅ Passed	The Makefile kind-deploy target and image tag updates in kustomization files are minor infrastructure improvements that support the GPU implementation testing and deployment workflow, while remaining aligned with the overall PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/1845-gcp-gpu-guest-accelerator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.

🧹 Nitpick comments (3)

Makefile (1)
95-102: Fail fast in kind-deploy and scope rollout to updated deployments.

Line 97–102: as written, a failed kubectl set image inside the loop can be masked, and the final kubectl rollout status deployment -n $(KIND_NAMESPACE) can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.
♻️ Suggested update
 kind-deploy: kind-load-images
 	`@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- "
-	`@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
+	`@set` -e; \
+	for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
 		echo " --- updating $$svc deployment --- "; \
 		kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \
+		kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \
 	done
-	`@echo` " --- waiting for rollouts to complete --- "
-	`@kubectl` rollout status deployment -n $(KIND_NAMESPACE)
+	`@echo` " --- rollouts completed --- "
internal/api/manifest/validate_node_pool.go (1)
145-170: Clarify error text for deprecated GPU count usage.

If users still set nvidiaGpu (deprecated), the current message may be confusing. Consider naming both fields.
🛠️ Suggested tweak
-		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0")
+		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")
internal/api/manifest/validate_test.go (1)
332-440: Consider adding test case for GCP with deprecated NvidiaGpu field.

The test covers the new NvidiaGpuCount field well for GCP and validates backward compatibility for non-GCP providers using the deprecated NvidiaGpu field. However, there's a gap: what happens when a GCP nodepool uses the deprecated NvidiaGpu field without specifying NvidiaGpuType?

If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.
Suggested additional test case
 	r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation")
+
+	// Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used)
+	gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{
+		Name:       "gpu-np-dep",
+		ServerType: "n1-standard-4",
+		Image:      "ubuntu-2204",
+		Count:      1,
+		ProviderSpec: ProviderSpec{
+			Name:   "gcp-1",
+			Region: "us-central1",
+			Zone:   "us-central1-a",
+		},
+		MachineSpec: &MachineSpec{
+			NvidiaGpu: 1, // Using deprecated field
+		},
+	}
+	r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation")
 }

…validation

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@internal/api/manifest/validate_node_pool.go`:
- Around line 161-169: The GPU-type presence check currently treats
whitespace-only strings as valid; update the validation in validate_node_pool
(where gpuCount is computed from d.MachineSpec.NvidiaGpuCount / NvidiaGpu) to
trim d.MachineSpec.NvidiaGpuType (e.g., using strings.TrimSpace) before testing
emptiness and return the same error when the trimmed value is empty; reference
the gpuCount variable and d.MachineSpec.NvidiaGpuType to locate the check and
ensure whitespace-only values are rejected.

Despire

LGTM 👍

Please also replace the old field here: https://github.com/berops/claudie/blob/master/docs/autoscaling/autoscaling.md

After we can merge.

Samuel STOLICNY (contractor) added 2 commits January 27, 2026 16:24

feat: add kind-deploy target to load images and update deployments in…

28bd016

… kind cluster

samuelstolicny mentioned this pull request Jan 27, 2026

feat(gcp): add guest_accelerator block for GPU support berops/claudie-config#22

Merged

samuelstolicny marked this pull request as ready for review January 28, 2026 08:39

samuelstolicny requested review from Despire, jakubhlavacka and m-brando January 28, 2026 08:39

samuelstolicny added feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger. labels Jan 28, 2026

coderabbitai Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread docs/input-manifest/providers/gcp.md

m-brando reviewed Feb 4, 2026

View reviewed changes

Comment thread internal/api/manifest/manifest.go

Comment thread internal/api/manifest/validate_node_pool.go

fix: return error for invalid provider type in GCP GPU configuration …

a52a5f6

…validation

m-brando reviewed Feb 4, 2026

View reviewed changes

Comment thread proto/spec/nodepool.proto

Merged master into feature/1845-gcp-gpu-guest-accelerator.

7f01109

coderabbitai Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread internal/api/manifest/validate_node_pool.go

samuelstolicny and others added 2 commits February 4, 2026 16:21

fix: skip provider validation error in GCP GPU config if already checked

b74a279

Auto commit - update kustomization.yaml

2b57f62

Despire approved these changes Feb 9, 2026

View reviewed changes

Comment thread internal/api/manifest/validate_node_pool.go

Comment thread internal/api/manifest/manifest.go

Comment thread internal/api/manifest/manifest.go

samuelstolicny and others added 3 commits February 9, 2026 10:00

docs: Update example in autoscaling.md

86d9198

Merge branch 'master' into feature/1845-gcp-gpu-guest-accelerator

915be7c

Auto commit - update kustomization.yaml

96534cc

samuelstolicny removed the request for review from jakubhlavacka February 9, 2026 10:16

samuelstolicny added this pull request to the merge queue Feb 9, 2026

Merged via the queue into master with commit 2a4c922 Feb 9, 2026

samuelstolicny deleted the feature/1845-gcp-gpu-guest-accelerator branch February 9, 2026 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gcp): add GPU guest accelerator support for GCP nodepools#1952

feat(gcp): add GPU guest accelerator support for GCP nodepools#1952
samuelstolicny merged 9 commits into
masterfrom
feature/1845-gcp-gpu-guest-accelerator

samuelstolicny commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Despire left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samuelstolicny commented Jan 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Example Usage

Backward Compatibility

Template Changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Despire left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samuelstolicny commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading