chore: expand machine spec to contain number of gpus#1854
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughAdded optional NVIDIA GPU support across MachineSpec (API, proto, CRD, autoscaler); relaxed cpuCount/memory validations and removed them from CRD required lists; aggregated provider GPU info (AWS/GCP); autoscaler capacity now includes GPUs and respects manifest overrides; docs/CRD text updated; deterministic test sorting and image tag bumps in manifests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant NP as NodeManager
participant DNP as DynamicNodePool
participant TI as typeInfo
participant RL as ResourceList
rect rgba(230,240,255,0.20)
note over NP: GetCapacity(np)
NP->>DNP: read DynamicNodePool
alt dnp is nil
NP-->>RL: return nil
else
NP->>TI: lookup typeInfo (provider + instance)
alt typeInfo missing
NP-->>RL: return nil
else
note over TI: cpu, memory, disk, nvidiaGpus
NP->>RL: build cpu/memory/storage resources
alt TI.nvidiaGpus > 0
NP->>RL: add `nvidia.com/gpu` = TI.nvidiaGpus
end
alt DNP.MachineSpec.NvidiaGpu > 0
NP->>RL: override `nvidia.com/gpu` = DNP.MachineSpec.NvidiaGpu
end
NP-->>RL: return ResourceList
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (7)
docs/input-manifest/api-reference.md (1)
262-265: Tighten wording and capitalization; note NVIDIA plugin requirementRecommend minor edits for consistency and clarity.
- - `cpuCount`: specifies the number of cpus used by the `serverType` - - `memory`: specifies the memory in GBs used by the `serverType` - - `nvidiaGpu`: specifies the number of nvidia GPUs used by the `serverType` + - `cpuCount`: number of CPUs for the `serverType` + - `memory`: memory in GB for the `serverType` + - `nvidiaGpu`: number of NVIDIA GPUs for the `serverType` (requires the NVIDIA device plugin on nodes)internal/api/manifest/utils.go (1)
293-296: Mapping MachineSpec is correct; consider guarding against int→int32 overflowThe casts are fine, but if manifest values ever exceed int32 (unlikely for CPUs/GPUs, but possible for memory if unit semantics change), this would truncate.
Optionally add a bounds check before casting, or constrain/validate the upper bound where you already validate the spec.
services/autoscaler-adapter/node_manager/node_manager.go (3)
71-74: Add a nil-check for dnp.Provider to prevent rare NPEsUnlikely, but if Provider is ever missing,
dnp.Provider.CloudProviderNamewill panic.- typeInfo := nm.getTypeInfo(dnp.Provider.CloudProviderName, dnp) + if dnp.Provider == nil { + return nil + } + typeInfo := nm.getTypeInfo(dnp.Provider.CloudProviderName, dnp)
76-83: Use local dnp variable and fix comment typoMinor clarity/readability.
- // Check if disk is define for the instance. + // Check if disk is defined for the instance. ... - disk = int64(np.GetDynamicNodePool().StorageDiskSize) * 1024 * 1024 * 1024 // Convert to bytes + disk = int64(dnp.StorageDiskSize) * 1024 * 1024 * 1024 // Convert to bytes
84-89: Verify resource key: prefer EphemeralStorage over Storage for node capacityKubernetes node capacity commonly reports ephemeral storage via
corev1.ResourceEphemeralStorage. UsingResourceStoragemay not have the intended effect for scheduling/models.- rl[k8sV1.ResourceStorage] = *resource.NewQuantity(disk, resource.DecimalSI) + rl[k8sV1.ResourceEphemeralStorage] = *resource.NewQuantity(disk, resource.DecimalSI)If other components rely on
ResourceStoragehere, ignore this change—but please confirm.internal/api/manifest/manifest.go (1)
128-137: Validation semantics: decide whether zero values should be allowed when provided
required_with=Memory|CpuCount, gte=0allows0for CPU/Memory if present. If the intent is “only positive when specified,” switch togt=0. If0is a sentinel for “use provider default,” consider pointer fields to disambiguate presence.Examples:
- Keep int, require positive when present:
validate:"required_with=Memory,gt=0"- Use pointers for presence:
CpuCount *int validate:"omitempty,gt=0"Memory *int validate:"omitempty,gt=0"NvidiaGpu *int validate:"omitempty,gte=0"manifests/claudie/crd/claudie.io_inputmanifests.yaml (1)
304-316: Align CRD validation with code: add minimum: 0 and fix NVIDIA capitalizationManifest validation uses
gte=0. Mirror that in the CRD for consistency.properties: cpuCount: description: CpuCount specifies the number of CPU cores the provided instance type will have. type: integer + minimum: 0 memory: description: Memory specifies the memory the provided instance type will have. - type: integer + type: integer + minimum: 0 nvidiaGpu: - description: Nvidia specifies the number of NVIDIA GPUs + description: NVIDIA specifies the number of NVIDIA GPUs the provided instance type will have. - type: integer + type: integer + minimum: 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
proto/pb/spec/nodepool.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (8)
docs/input-manifest/api-reference.md(1 hunks)internal/api/manifest/manifest.go(1 hunks)internal/api/manifest/utils.go(1 hunks)manifests/claudie/crd/claudie.io_inputmanifests.yaml(1 hunks)manifests/claudie/crd/claudie.io_settings.yaml(0 hunks)manifests/claudie/crd/kustomization.yaml(0 hunks)proto/spec/nodepool.proto(2 hunks)services/autoscaler-adapter/node_manager/node_manager.go(1 hunks)
💤 Files with no reviewable changes (2)
- manifests/claudie/crd/kustomization.yaml
- manifests/claudie/crd/claudie.io_settings.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
services/autoscaler-adapter/node_manager/node_manager.go (2)
internal/api/manifest/manifest.go (1)
MachineSpec(127-137)proto/pb/spec/nodepool.pb.go (3)
MachineSpec(557-564)MachineSpec(577-577)MachineSpec(592-594)
internal/api/manifest/utils.go (2)
internal/api/manifest/manifest.go (1)
MachineSpec(127-137)proto/pb/spec/nodepool.pb.go (3)
MachineSpec(557-564)MachineSpec(577-577)MachineSpec(592-594)
🔇 Additional comments (4)
proto/spec/nodepool.proto (2)
94-95: Proto field addition LGTM; confirm presence semantics are acceptableAdding
int32 nvidiaGpu = 3;is wire‑compatible and aligns with usages gating on> 0. Proto3 can’t distinguish “unset” from “0”; if you ever need that, consider a wrapper type. Otherwise good as-is.Would you like me to verify all generated bindings are updated and committed (e.g., nodepool.pb.go includes
NvidiaGpu)? I can provide a quick grep script.
116-116: No action neededFormatting-only change.
services/autoscaler-adapter/node_manager/node_manager.go (2)
66-69: Good defensive nil-check for DynamicNodePoolEarly return on nil dnp prevents panics when called on static pools.
90-95: GPU capacity entry looks good; call out deployment prerequisiteAdding
nvidia.com/gpuwhenNvidiaGpu > 0matches extended resource conventions.Note: Pods will schedule only if the NVIDIA device plugin advertises this resource on the node. Confirm this is documented for users of GPU pools.
jakubhlavacka
left a comment
There was a problem hiding this comment.
I don't understand why we need to update the InputManifest and remove Settings CRD.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
services/autoscaler-adapter/node_manager/utils.go (3)
221-221: Fix typo in function comment.The comment has a typo in the type name.
-// getTypeInfoGcp converts []*computepb.MachineTypeto typeInfo map of instances, where keys are instance types. +// getTypeInfoGcp converts []*computepb.MachineType to typeInfo map of instances, where keys are instance types.
263-271: Consider documenting the OCI GPU limitation in the codebase.While the inline comment explains the lack of vendor metadata API, consider adding documentation in the user-facing documentation or README about this limitation for OCI instances.
Would you like me to help create documentation that explains the GPU limitations for OCI and how users can work around it using the
MachineSpec.NvidiaGpufield?
280-288: Consider documenting the Azure GPU limitation.Similar to OCI, the Azure provider lacks vendor metadata API. This limitation should be documented for users.
Would you like me to help create documentation that explains the GPU limitations for Azure and how users can work around it using the
MachineSpec.NvidiaGpufield?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
manifests/claudie/crd/claudie.io_settings.yaml(1 hunks)manifests/claudie/kustomization.yaml(1 hunks)manifests/testing-framework/kustomization.yaml(1 hunks)services/autoscaler-adapter/node_manager/node_manager.go(2 hunks)services/autoscaler-adapter/node_manager/utils.go(4 hunks)
✅ Files skipped from review due to trivial changes (1)
- manifests/claudie/kustomization.yaml
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-10T08:04:36.596Z
Learnt from: Despire
PR: berops/claudie#1735
File: manifests/claudie/kustomization.yaml:59-61
Timestamp: 2025-06-10T08:04:36.596Z
Learning: In Claudie's deployment process, different images in the same kustomization.yaml file may have different tags intentionally. This is part of their e2e process where only relevant images are updated when changes propagate, rather than updating all images to the same tag. This selective updating approach is deliberate and not an inconsistency.
Applied to files:
manifests/testing-framework/kustomization.yaml
📚 Learning: 2025-06-10T08:26:54.132Z
Learnt from: Despire
PR: berops/claudie#1735
File: services/ansibler/server/domain/usecases/setup_loadbalancers.go:231-238
Timestamp: 2025-06-10T08:26:54.132Z
Learning: In Claudie's codebase, Role.Settings is always initialized with default values during the desired state creation process in services/manager/internal/service/create_desired_state.go. This means that accessing tg.Role.Settings.EnvoyCds, EnvoyLds, and EnvoyAdminPort will never cause a nil pointer panic since the Settings struct is guaranteed to exist even when no CRD settings are provided.
Applied to files:
manifests/claudie/crd/claudie.io_settings.yaml
🧬 Code graph analysis (1)
services/autoscaler-adapter/node_manager/node_manager.go (2)
internal/api/manifest/manifest.go (1)
MachineSpec(127-137)proto/pb/spec/nodepool.pb.go (3)
MachineSpec(557-564)MachineSpec(577-577)MachineSpec(592-594)
🔇 Additional comments (10)
manifests/testing-framework/kustomization.yaml (2)
93-93: Image tag bump looks good and matches the e2e flow.Acking the selective image tag update; this aligns with the team's intentional process for kustomizations where only relevant images are bumped.
93-93: Optional: pin testing-framework image by digest — GHCR API returned null digest; confirm and add digest if available.GH API query for ghcr.io/berops/claudie/testing-framework (tag 6c3c266-3581) returned "digest": null for all versions, so the manifest digest couldn't be retrieved via that endpoint. If you want reproducible images, obtain the digest and add it to kustomization.yaml:
images: - name: ghcr.io/berops/claudie/testing-framework newTag: 6c3c266-3581 + # Optional: pin for reproducibility (digest takes precedence over tag) + # Replace <sha256> with the actual manifest digest + digest: sha256:<sha256>To fetch the digest (authenticate as needed), use one of:
- crane digest ghcr.io/berops/claudie/testing-framework:6c3c266-3581
- docker pull ghcr.io/berops/claudie/testing-framework:6c3c266-3581 && docker image inspect --format='{{index .RepoDigests 0}}' ghcr.io/berops/claudie/testing-framework:6c3c266-3581
services/autoscaler-adapter/node_manager/utils.go (3)
179-183: LGTM! Clear implementation for Hetzner's lack of GPU support.The addition of the
nvidiaGpusfield with an explicit value of 0 and helpful comment makes it clear that Hetzner doesn't currently support GPU instances.
192-217: Robust GPU counting implementation for AWS.Good implementation with proper null checks and manufacturer validation. The logic correctly accumulates GPU counts only for NVIDIA GPUs.
225-254: Well-documented GCP GPU handling with API limitations.Excellent documentation explaining the API limitations for GPU information retrieval. The fallback mechanism to user-provided values in
MachineSpec.NvidiaGpusis a good solution. The implementation correctly handles the accelerator data when available.services/autoscaler-adapter/node_manager/node_manager.go (4)
39-41: LGTM! Well-documented struct field addition.The addition of the
nvidiaGpusfield with clear documentation is appropriate.
68-77: Good defensive programming with nil checks.The early return pattern with nil checks prevents potential panics.
92-94: Proper GPU resource allocation.The implementation correctly adds NVIDIA GPU resources to the Kubernetes ResourceList using the standard
nvidia.com/gpuresource name.
96-100: Good override mechanism for user-specified GPU counts.The logic properly prioritizes user-specified GPU counts from
MachineSpec.NvidiaGpuover cached values, providing flexibility for cloud providers without proper GPU metadata APIs.manifests/claudie/crd/claudie.io_settings.yaml (1)
1-86: Formatting changes only - no functional impact.The file contains only formatting and indentation changes to the CRD OpenAPI schema with no modifications to the actual fields or API surface. The content remains semantically identical.
For the Autoscaling to work properly when we pass the Template of a Node From a NodePool we Need to tell it that it has GPU capacity available, which we currently do not do. But Also it seems that for most Cloud providers supported by Claudie there doesn't seem to be a way programatically to resolve if the instance has GPU's and if they're from Nvidia. For Aws this is not an issue, but for the others it is, for example Google treats GPUs as someting as an "extension" and you can configure 1,2,4,8 GPUs, and seems like you can't determine this programatically, just by fetching the Instance Type, at least not for now, So that's why a new field was added claudie/services/autoscaler-adapter/node_manager/node_manager.go Lines 92 to 100 in f3b9b6e |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/autoscaling/autoscaling.md (1)
38-54: Add YAML language to fenced block; fix apostrophe and small copy nits.
- Specify language for syntax highlighting and to satisfy MD040.
- Use “GPUs” (no apostrophe).
- Make closing fence unindented to avoid rendering quirks.
- Clarify comment that the instance type example is AWS-specific.
-``` +```yaml - name: autoscaled providerSpec: name: aws region: eu-central-1 zone: eu-central-1a autoscaler: min: 0 max: 20 - # GPU machine type name. + # GPU-capable instance type (AWS example). serverType: g4dn.xlarge machineSpec: - # explicitly specify how many GPU's the instance type provides. + # Explicitly specify how many GPUs the instance type provides. nvidiaGpu: 1 # OS image name image: ami-07eef52105e8a2059 - ``` +```
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/autoscaling/autoscaling.md(1 hunks)services/autoscaler-adapter/node_manager/node_manager.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- services/autoscaler-adapter/node_manager/node_manager.go
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/autoscaling/autoscaling.md
38-38: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: check-changes
🔇 Additional comments (1)
docs/autoscaling/autoscaling.md (1)
36-37: Clarify GPU auto-detection behavior and align terminology.File: docs/autoscaling/autoscaling.md (around lines 36–37). Current wording implies GPUs are never auto‑detected; the code attempts detection for some providers but not all (e.g., GCP attachable accelerators or providers without vendor GPU metadata). Use “NVIDIA GPUs”, “node pool”, and the YAML field
nvidiaGpuconsistently.-The custom Claudie-Provider for the [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) does not automatically determine whether the provided instance types have GPU capabilities. If you want autoscaling for a nodepool with GPUs, you must explicitly specify how many GPUs each node in the nodepool has. +The Claudie provider cannot always auto‑detect NVIDIA GPU counts from instance types across all providers. When the provider cannot determine the GPU count (e.g., GCP attachable accelerators or providers without GPU metadata), explicitly set the per‑node count via `machineSpec.nvidiaGpu`. On providers where detection is supported (e.g., some AWS GPU families), this field is optional and will override auto‑detection if set. If you want autoscaling for a node pool with GPUs, ensure each node’s NVIDIA GPU count is known to Claudie.
Expands the machine spec to describe the number of GPUs the given instance type has. This will then be used in the Node Template for the cluster-autoscaler.
Partially also implements #1853 where applicable. But it seems that for not all of the Claudie supported providers we can programmatically determine the number of NVIDIA GPUs attached.
Thats why the change in the InputManifest for a dynamic nodepool:
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
API Changes
Tests
Chores