Skip to content

feat(karpenter): update instance type handling logic#19

Merged
anson627 merged 17 commits into
mainfrom
hbc/karp-instance-type
Mar 7, 2026
Merged

feat(karpenter): update instance type handling logic#19
anson627 merged 17 commits into
mainfrom
hbc/karp-instance-type

Conversation

@bcho
Copy link
Copy Markdown
Member

@bcho bcho commented Mar 1, 2026

This pull request introduces several improvements and refactors to the Nebius cloud provider integration for Karpenter, focusing on enhanced instance type resolution, better configurability, and improved error handling. The most significant changes include the introduction of a new instanceTypeProvider abstraction for managing instance types and platform presets, support for a configurable maximum pods per node in the Nebius node class, and updates to startup taints in example node pool configurations. Additionally, the codebase now handles quota and capacity errors more robustly and includes dependency updates.

Core provider enhancements and refactoring:

  • Introduced an instanceTypeProvider abstraction (in pkg/cloudproviders/nebius/instancetype) to manage instance type and platform preset resolution, and refactored the Nebius cloud provider (cloudprovider.go) to use this provider for instance type selection, retrieval, and mapping throughout the node lifecycle. This change centralizes logic and improves maintainability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

  • Refactored the provider's registration and construction to pass the Kubernetes cluster version, which is now used in resource creation and instance type provider initialization. [1] [2] [3] [4]

Configurability and CRD/schema updates:

  • Added a new maxPodsPerNode field to the Nebius node class CRD (nebiusnodeclasses.yaml) and corresponding Go struct (nebius.go), allowing users to specify the maximum number of pods per node. This value is now used in instance type resolution and advertised in node capacity. Deep copy logic was updated accordingly. [1] [2] [3]

Example and scheduling improvements:

  • Updated Nebius CPU and GPU node pool YAML examples to include a startupTaints section, specifically adding a taint to avoid Cilium-related scheduling issues during node startup. [1] [2]

Error handling and cleanup:

  • Improved quota and capacity error detection by recognizing "insufficient capacity" errors and enhanced cleanup logic after quota failures, including a timeout context and more robust documentation for future improvements. [1] [2]

Dependency updates:

  • Updated dependencies in go.mod files for both CLI and Karpenter modules, including switching to a custom AKSFlexNode version and adjusting indirect dependencies. [1] [2] [3] [4]

@bcho bcho marked this pull request as ready for review March 4, 2026 21:52
@bcho bcho changed the title WIP: feat(karpenter): update instance type handling logic feat(karpenter): update instance type handling logic Mar 4, 2026
// FIXME: don't leak go routine here
// FIXME: use a better clean up helper to perform the clean up in background
//
// TODO: currently nebius doesn't provide a way for us to check if the capacity exists before real creation.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked to someone from Azure karpenter side, he mentioned they use a in-memory cache to handle such case for filtering out bad vm size. We can implement this as a follow up

bcho and others added 2 commits March 6, 2026 10:52
# Conflicts:
#	cli/go.mod
#	cli/go.sum
#	karpenter/go.mod
#	karpenter/go.sum
#	plugin/go.mod
#	plugin/go.sum
@anson627 anson627 merged commit a468304 into main Mar 7, 2026
9 checks passed
@bcho bcho deleted the hbc/karp-instance-type branch March 7, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants