refactor(scheduler): migrate NodeInfo, PodInfo, and plugin resource r…#1146
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe pull request refactors the scheduler's resource representation system, migrating from scalar Resource pointers to vector-based ResourceVector value types with an accompanying ResourceVectorMap for indexed access. This systematic change affects resource allocation, topology calculations, error handling, and logging throughout multiple scheduler components and plugins. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
📊 Performance Benchmark ResultsComparing PR ( Legend
Raw benchmark dataPR branch: Main branch: |
db49cb7 to
4a7b05d
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/scheduler/actions/common/allocate.go (1)
177-190:⚠️ Potential issue | 🟡 MinorFix log wording to match vector payload.
These logs now print
task.ResReqVector(full resource vector), but the message still says “requires … GPUs”. That is misleading during debugging.Suggested patch
- log.InfraLogger.V(6).Infof("Binding Task <%v/%v> to node <%v>, requires: %v GPUs", + log.InfraLogger.V(6).Infof("Binding Task <%v/%v> to node <%v>, requires resources vector: %v", task.Namespace, task.Name, node.Name, task.ResReqVector) - log.InfraLogger.V(6).Infof("Pipelining Task <%v/%v> to node <%v> requires: %v GPUs", + log.InfraLogger.V(6).Infof("Pipelining Task <%v/%v> to node <%v> requires resources vector: %v", task.Namespace, task.Name, node.Name, task.ResReqVector)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/actions/common/allocate.go` around lines 177 - 190, Update the log messages that currently print task.ResReqVector but say "requires: ... GPUs" to accurately reflect the payload; change the wording in the binding log (the log right before stmt.Allocate(...) inside the Allocate/Bind function) and in pipelineTaskToNode(...) so they mention "requires resource vector" or "requires resources" (or similar) instead of "requires: %v GPUs", keeping the same %v placeholder for task.ResReqVector.
🧹 Nitpick comments (9)
pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus.go (1)
154-154: Use a single constant for the GPU vector key.Both Line 154 and Line 177 hardcode
"gpu". Centralizing this key reduces typo/drift risk while vector migration is still in progress.Suggested refactor
const ( idleGpuFilterName = "AccumulatedIdleGpus" nonAccumulatedScenarioBaseError = "accumulatedIdleGpus requires all the filters scenarios using the same instance to be based on the same scenario with accumulation of potential victims. " requiredResourcesDiffError = "The pending task %s didn't appear in the scenario given at the AccumulatedIdleGpus ctor" recordedVictimsDiffError = "The recorded victims should remain the same between the different scenario filtering. %d cache hits, pre update recorded tasks seen %d, post update recorded tasks seen %d" potentialVictimsDiffError = "The list of potential victims for the current scenario should contain the previous list of potential victims. Only %d out %d tasks are the contained in the current scenario" + gpuVectorKey = "gpu" ) @@ - requiredResources = append(requiredResources, pod.ResReqVector.Get(pod.VectorMap.GetIndex("gpu"))) + requiredResources = append(requiredResources, pod.ResReqVector.Get(pod.VectorMap.GetIndex(gpuVectorKey))) @@ - ig.nodesNameToIdleGpus[task.NodeName] += task.AcceptedResourceVector.Get(task.VectorMap.GetIndex("gpu")) + ig.nodesNameToIdleGpus[task.NodeName] += task.AcceptedResourceVector.Get(task.VectorMap.GetIndex(gpuVectorKey))Also applies to: 177-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus.go` at line 154, The code hardcodes "gpu" in multiple places (e.g., in the requiredResources append using pod.ResReqVector.Get(pod.VectorMap.GetIndex("gpu")) and the other occurrence at line 177); introduce a single file-level constant (e.g., const gpuVectorKey = "gpu") and replace all hardcoded "gpu" usages with that constant (use pod.VectorMap.GetIndex(gpuVectorKey) wherever needed) so the key is centralized and avoids drift during vector migration.pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus_test.go (1)
28-28: Prefer per-test vector map instances over a shared package variable.Line 28 introduces mutable shared test state. A per-test map keeps fixtures isolated and avoids hidden coupling as more vector keys are introduced in future tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus_test.go` at line 28, Replace the package-level mutable testVectorMap with per-test instances to avoid shared mutable state: remove the var testVectorMap = resource_info.NewResourceVectorMap() and instantiate resource_info.NewResourceVectorMap() inside each test (or a test setup helper) that currently references testVectorMap (search for usages of testVectorMap in idle_gpus_test.go) so each test gets its own fresh ResourceVectorMap fixture.pkg/scheduler/plugins/proportion/proportion_test.go (1)
623-627: Reduce duplicated resource fixture setup forResReq/ResReqVector.The same requirement is repeatedly parsed twice per pod (once for
ResReq, once forResReqVector). Create the requirement once and derive both fields from it to avoid fixture drift.♻️ Example pattern
-ResReq: common_info.BuildResourceRequirements("1", "1G"), -ResReqVector: common_info.BuildResourceRequirements("1", "1G").ToVector(testVectorMap), +req := common_info.BuildResourceRequirements("1", "1G") +ResReq: req, +ResReqVector: req.ToVector(testVectorMap),Also applies to: 634-638, 661-665, 672-676, 686-690, 713-717, 724-728, 738-742, 843-847, 875-879, 881-885, 887-891, 921-925, 927-931, 958-962, 964-968
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/proportion/proportion_test.go` around lines 623 - 627, The test fixtures duplicate parsing by calling common_info.BuildResourceRequirements twice for ResReq and ResReqVector; fix by creating a single requirement variable (e.g., req := common_info.BuildResourceRequirements("2", "2G")) and reuse it for both fields so ResReq: req and ResReqVector: req.ToVector(testVectorMap); apply the same refactor to all similar blocks that build resources (lines referenced in the comment) to avoid fixture drift and keep testVectorMap usage consistent.pkg/scheduler/plugins/proportion/utils/utils.go (1)
24-24: Add a GoDoc comment for exportedQuantifyVector.Please add a short GoDoc comment starting with
QuantifyVector ....As per coding guidelines "Add GoDoc-style comments for exported functions and types."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/proportion/utils/utils.go` at line 24, QuantifyVector is missing a GoDoc comment; add a GoDoc-style comment directly above the exported function QuantifyVector that begins with "QuantifyVector ..." and briefly describes what the function does, its parameters (vec and vectorMap) and the returned rs.ResourceQuantities to satisfy the project's documentation guidelines.pkg/scheduler/plugins/proportion/proportion.go (1)
221-239: RefactorgetResourcesto single-pass accumulation and nil-skip.Current implementation allocates an intermediate slice and then re-iterates. A single pass is simpler and safer if any
AcceptedResourceVectoris nil.Suggested patch
func getResources(ignoreReallocatedTasks bool, pods ...*pod_info.PodInfo) resource_info.ResourceVector { - var vectors []resource_info.ResourceVector + var total resource_info.ResourceVector for _, task := range pods { if ignoreReallocatedTasks && pod_status.IsActiveAllocatedStatus(task.Status) { continue } - vectors = append(vectors, task.AcceptedResourceVector) - } - - if len(vectors) == 0 { - return nil - } - - total := vectors[0].Clone() - for _, vec := range vectors[1:] { + vec := task.AcceptedResourceVector + if vec == nil { + continue + } + if total == nil { + total = vec.Clone() + continue + } total.Add(vec) } return total }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/proportion/proportion.go` around lines 221 - 239, The getResources function currently builds an intermediate slice then sums it; instead, iterate once over pods, skip tasks where ignoreReallocatedTasks && pod_status.IsActiveAllocatedStatus(task.Status) and also skip when task.AcceptedResourceVector is nil, and accumulate directly into a single resource_info.ResourceVector named total by setting total = task.AcceptedResourceVector.Clone() for the first non-nil vector and total.Add(vec) for subsequent ones; if no non-nil vectors are seen return nil. Use the existing function name getResources, types resource_info.ResourceVector and pod_info.PodInfo, and the AcceptedResourceVector field/Clone/Add methods referenced in the diff.pkg/scheduler/cache/cluster_info/cluster_info.go (1)
285-285: Replace the hardcoded GPU key with the shared constant.Using a string literal here is brittle when a canonical constant already exists in this file.
♻️ Proposed cleanup
- if nodeInfo.AllocatableVector.Get(nodeInfo.VectorMap.GetIndex("gpu")) > 0 { + if nodeInfo.AllocatableVector.Get(nodeInfo.VectorMap.GetIndex(constants.GpuResource)) > 0 {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/cache/cluster_info/cluster_info.go` at line 285, The code uses a hardcoded string literal "gpu" in nodeInfo.VectorMap.GetIndex("gpu") when checking nodeInfo.AllocatableVector.Get(...); replace that literal with the package-level/shared GPU constant (use the existing canonical constant in this file, e.g., GPUResourceKey) so the check reads nodeInfo.VectorMap.GetIndex(GPUResourceKey) (keep using nodeInfo, AllocatableVector.Get and VectorMap.GetIndex as-is).pkg/scheduler/plugins/topology/job_filtering.go (1)
113-124: Extract duplicated task-vector aggregation into one helper.The same accumulation logic exists in
getTasksAllocationMetadataandsortTreeFromRoot; consolidating it will reduce drift risk.Also applies to: 431-440
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/topology/job_filtering.go` around lines 113 - 124, The aggregation logic that builds a combined ResourceVector and task count is duplicated in getTasksAllocationMetadata and in sortTreeFromRoot; extract it into a single helper (e.g., aggregateTasksResources or computeTasksResourceSummary) that accepts []*pod_info.PodInfo and returns (resource_info.ResourceVector, int), move the loop that clones the first task.ResReqVector and adds subsequent vectors into that helper, and replace the bodies of getTasksAllocationMetadata and the code in sortTreeFromRoot (and the duplicate at lines ~431-440) to call the new helper to obtain tasksResources and tasksCount.pkg/scheduler/plugins/proportion/reclaimable/strategies/strategies.go (1)
18-24: Add a GoDoc comment for exportedFitsReclaimStrategy.The exported function signature was updated, but it still has no GoDoc comment.
As per coding guidelines "Add GoDoc-style comments for exported functions and types".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/proportion/reclaimable/strategies/strategies.go` around lines 18 - 24, Add a GoDoc comment for the exported function FitsReclaimStrategy that briefly describes what the function does and explains its parameters and return value: mention that it determines whether a reclaimer (reclaimerResources) can reclaim resources from a reclaimee given the resource vector map (vectorMap), queue attributes for reclaimer and reclaimee (reclaimerQueue, reclaimeeQueue) and the reclaimee's remaining share (reclaimeeRemainingShare), and that it returns a bool indicating whether the reclaim fits; place the comment immediately above the FitsReclaimStrategy declaration and follow GoDoc conventions (starts with the function name and is a complete sentence).pkg/scheduler/plugins/proportion/reclaimable/reclaimable.go (1)
6-20: Reorder imports into standard / external / internal groups.
k8s.io/api/core/v1is an external dependency and should be separated from internal imports.♻️ Suggested import grouping
import ( "fmt" "maps" "math" "strings" + v1 "k8s.io/api/core/v1" + commonconstants "github.com/NVIDIA/KAI-scheduler/pkg/common/constants" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/common_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/log" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/proportion/reclaimable/strategies" rs "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/proportion/resource_share" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/proportion/utils" - v1 "k8s.io/api/core/v1" )As per coding guidelines "Organize imports in three groups separated by blank lines: (1) Standard library, (2) External dependencies, (3) Internal packages".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/scheduler/plugins/proportion/reclaimable/reclaimable.go` around lines 6 - 20, Reorder the import block in reclaimable.go into three groups separated by blank lines: (1) standard library imports (fmt, maps, math, strings), (2) external dependencies (v1 "k8s.io/api/core/v1"), and (3) internal packages (commonconstants, common_info, resource_info, log, strategies, rs, utils) while preserving existing aliases (e.g., v1 and rs) and import order within each group to satisfy the project's import grouping guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/scheduler/api/common_info/pod_errors.go`:
- Around line 80-83: The code reads vectors using indices from
vectorMap.GetIndex without checking for -1; update the logic around the uses of
vectorMap.GetIndex for migProfile (and the analogous scalar/resource access
later) to first capture the index (e.g., migIdx :=
vectorMap.GetIndex(string(migProfile))) and guard with if migIdx >= 0 before
calling availableVector.Get(migIdx) or capacityVector.Get(migIdx); if the index
is < 0, return or surface an explicit error/log indicating the resource/MIG
profile is unregistered (so you don't silently treat it as zero), mirroring the
existing pattern used elsewhere in resource_vector.go.
In `@pkg/scheduler/api/node_info/node_info.go`:
- Around line 394-403: The helper lessEqualVectorsExcludingGPU mutates its input
ResourceVector values (a and b) which can race; change it to a non-mutating
comparison: get gpuIdx via ni.VectorMap.GetIndex(commonconstants.GpuResource)
and then loop over all valid indices (or use a length accessor) comparing
a.Get(i) <= b.Get(i) for each i except gpuIdx, returning false on the first
violation and true otherwise. Ensure you reference and use only
ResourceVector.Get and the VectorMap/GetIndex (no Set calls) so neither
task.ResReqVector nor node vectors are modified during the check.
In `@pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy.go`:
- Around line 79-83: getRequiredQuota iterates tasksToAllocate and blindly calls
utils.QuantifyVector(pod.ResReqVector, pod.VectorMap) which will nil-deref for
pods that haven't initialized vector fields; add a defensive check in
getRequiredQuota: if pod.ResReqVector==nil || pod.VectorMap==nil then compute
quantities from the pod's scalar resource fields (e.g., CPU/milliCPU, Memory,
GPU scalar fields the pod struct exposes) and add those to quota.GPU,
quota.MilliCPU and quota.Memory, otherwise call utils.QuantifyVector as before;
update variable names (quantities, quota.GPU/MilliCPU/Memory) accordingly so
behavior is identical for vectorized pods.
In `@pkg/scheduler/plugins/proportion/proportion_test.go`:
- Around line 811-817: The node's VectorMap and AllocatableVector are being
built from a newly created vectorMap, which can diverge from the test fixtures'
testVectorMap and break vector index alignment; change the setup so
testData.node.VectorMap is assigned the existing testVectorMap (use
testVectorMap instead of a locally created vectorMap) and compute
testData.node.AllocatableVector via
testData.node.Allocatable.ToVector(testVectorMap) so getNodeResources and pod
vectors share the same ResourceVectorMap instance.
In `@pkg/scheduler/plugins/proportion/reclaimable/reclaimable.go`:
- Around line 269-289: getInvolvedResourcesNames currently only checks the
GpuResource slot; add a check for the MIG-backed GPU slot so MIG-only vectors
also mark GPU as involved: retrieve the MIG index from vectorMap (similar to
cpuIdx/memIdx/gpuIdx, e.g. migIdx :=
vectorMap.GetIndex(commonconstants.MigResource) or the project's MIG constant),
ensure the index is valid, and if vec.Get(migIdx) > 0 set
involvedResources[rs.GpuResource] = struct{}{} (keep the existing checks for
cpuIdx/memIdx/gpuIdx intact in the getInvolvedResourcesNames function).
In `@pkg/scheduler/plugins/topology/job_filtering_test.go`:
- Around line 31-32: Tests currently mix a global testVectorMap (created via
resource_info.NewResourceVectorMap()) with per-tree VectorMap instances causing
inconsistent vector/indices; replace the global testVectorMap by creating a
single VectorMap inside each test and pass that same VectorMap instance to every
topology/tree creation in that test (instead of creating new maps inside
functions), update usages referencing testVectorMap so all calls use the local
VectorMap, and remove any other per-tree NewResourceVectorMap() calls so each
test runs end-to-end with one consistent VectorMap.
In `@pkg/scheduler/plugins/topology/topology_plugin.go`:
- Around line 59-64: If nodes can be empty, ensure sharedVectorMap is
initialized to a non-nil default instead of staying nil: after the loop that
sets sharedVectorMap from nodeInfo.VectorMap, check if sharedVectorMap == nil
and, if so, create and assign a new resource_info.ResourceVectorMap (or call the
existing constructor/helper) before it is used or assigned into topology trees
(references: sharedVectorMap, nodes, resource_info.ResourceVectorMap); this
prevents nil derefs when later converting/indexing vectors.
---
Outside diff comments:
In `@pkg/scheduler/actions/common/allocate.go`:
- Around line 177-190: Update the log messages that currently print
task.ResReqVector but say "requires: ... GPUs" to accurately reflect the
payload; change the wording in the binding log (the log right before
stmt.Allocate(...) inside the Allocate/Bind function) and in
pipelineTaskToNode(...) so they mention "requires resource vector" or "requires
resources" (or similar) instead of "requires: %v GPUs", keeping the same %v
placeholder for task.ResReqVector.
---
Nitpick comments:
In
`@pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus_test.go`:
- Line 28: Replace the package-level mutable testVectorMap with per-test
instances to avoid shared mutable state: remove the var testVectorMap =
resource_info.NewResourceVectorMap() and instantiate
resource_info.NewResourceVectorMap() inside each test (or a test setup helper)
that currently references testVectorMap (search for usages of testVectorMap in
idle_gpus_test.go) so each test gets its own fresh ResourceVectorMap fixture.
In
`@pkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus.go`:
- Line 154: The code hardcodes "gpu" in multiple places (e.g., in the
requiredResources append using
pod.ResReqVector.Get(pod.VectorMap.GetIndex("gpu")) and the other occurrence at
line 177); introduce a single file-level constant (e.g., const gpuVectorKey =
"gpu") and replace all hardcoded "gpu" usages with that constant (use
pod.VectorMap.GetIndex(gpuVectorKey) wherever needed) so the key is centralized
and avoids drift during vector migration.
In `@pkg/scheduler/cache/cluster_info/cluster_info.go`:
- Line 285: The code uses a hardcoded string literal "gpu" in
nodeInfo.VectorMap.GetIndex("gpu") when checking
nodeInfo.AllocatableVector.Get(...); replace that literal with the
package-level/shared GPU constant (use the existing canonical constant in this
file, e.g., GPUResourceKey) so the check reads
nodeInfo.VectorMap.GetIndex(GPUResourceKey) (keep using nodeInfo,
AllocatableVector.Get and VectorMap.GetIndex as-is).
In `@pkg/scheduler/plugins/proportion/proportion_test.go`:
- Around line 623-627: The test fixtures duplicate parsing by calling
common_info.BuildResourceRequirements twice for ResReq and ResReqVector; fix by
creating a single requirement variable (e.g., req :=
common_info.BuildResourceRequirements("2", "2G")) and reuse it for both fields
so ResReq: req and ResReqVector: req.ToVector(testVectorMap); apply the same
refactor to all similar blocks that build resources (lines referenced in the
comment) to avoid fixture drift and keep testVectorMap usage consistent.
In `@pkg/scheduler/plugins/proportion/proportion.go`:
- Around line 221-239: The getResources function currently builds an
intermediate slice then sums it; instead, iterate once over pods, skip tasks
where ignoreReallocatedTasks && pod_status.IsActiveAllocatedStatus(task.Status)
and also skip when task.AcceptedResourceVector is nil, and accumulate directly
into a single resource_info.ResourceVector named total by setting total =
task.AcceptedResourceVector.Clone() for the first non-nil vector and
total.Add(vec) for subsequent ones; if no non-nil vectors are seen return nil.
Use the existing function name getResources, types resource_info.ResourceVector
and pod_info.PodInfo, and the AcceptedResourceVector field/Clone/Add methods
referenced in the diff.
In `@pkg/scheduler/plugins/proportion/reclaimable/reclaimable.go`:
- Around line 6-20: Reorder the import block in reclaimable.go into three groups
separated by blank lines: (1) standard library imports (fmt, maps, math,
strings), (2) external dependencies (v1 "k8s.io/api/core/v1"), and (3) internal
packages (commonconstants, common_info, resource_info, log, strategies, rs,
utils) while preserving existing aliases (e.g., v1 and rs) and import order
within each group to satisfy the project's import grouping guideline.
In `@pkg/scheduler/plugins/proportion/reclaimable/strategies/strategies.go`:
- Around line 18-24: Add a GoDoc comment for the exported function
FitsReclaimStrategy that briefly describes what the function does and explains
its parameters and return value: mention that it determines whether a reclaimer
(reclaimerResources) can reclaim resources from a reclaimee given the resource
vector map (vectorMap), queue attributes for reclaimer and reclaimee
(reclaimerQueue, reclaimeeQueue) and the reclaimee's remaining share
(reclaimeeRemainingShare), and that it returns a bool indicating whether the
reclaim fits; place the comment immediately above the FitsReclaimStrategy
declaration and follow GoDoc conventions (starts with the function name and is a
complete sentence).
In `@pkg/scheduler/plugins/proportion/utils/utils.go`:
- Line 24: QuantifyVector is missing a GoDoc comment; add a GoDoc-style comment
directly above the exported function QuantifyVector that begins with
"QuantifyVector ..." and briefly describes what the function does, its
parameters (vec and vectorMap) and the returned rs.ResourceQuantities to satisfy
the project's documentation guidelines.
In `@pkg/scheduler/plugins/topology/job_filtering.go`:
- Around line 113-124: The aggregation logic that builds a combined
ResourceVector and task count is duplicated in getTasksAllocationMetadata and in
sortTreeFromRoot; extract it into a single helper (e.g., aggregateTasksResources
or computeTasksResourceSummary) that accepts []*pod_info.PodInfo and returns
(resource_info.ResourceVector, int), move the loop that clones the first
task.ResReqVector and adds subsequent vectors into that helper, and replace the
bodies of getTasksAllocationMetadata and the code in sortTreeFromRoot (and the
duplicate at lines ~431-440) to call the new helper to obtain tasksResources and
tasksCount.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3560b2aa-a32c-4a45-a10b-4d963473d8f0
📒 Files selected for processing (38)
pkg/scheduler/actions/common/allocate.gopkg/scheduler/actions/common/feasible_nodes_test.gopkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus.gopkg/scheduler/actions/common/solvers/accumulated_scenario_filters/idle_gpus/idle_gpus_test.gopkg/scheduler/actions/common/solvers/pod_scenario_builder_test.gopkg/scheduler/actions/utils/job_order_by_queue_test.gopkg/scheduler/api/common_info/job_errors.gopkg/scheduler/api/common_info/job_errors_test.gopkg/scheduler/api/common_info/pod_errors.gopkg/scheduler/api/common_info/pod_errors_test.gopkg/scheduler/api/node_info/gpu_sharing_node_info.gopkg/scheduler/api/node_info/node_info.gopkg/scheduler/api/node_info/node_info_test.gopkg/scheduler/api/pod_info/pod_info.gopkg/scheduler/cache/cluster_info/cluster_info.gopkg/scheduler/framework/session.gopkg/scheduler/framework/statement.gopkg/scheduler/plugins/nodeavailability/nodeavailability.gopkg/scheduler/plugins/nodeplacement/nodepack_test.gopkg/scheduler/plugins/nodeplacement/nodespread_test.gopkg/scheduler/plugins/nodeplacement/pack.gopkg/scheduler/plugins/nodeplacement/spread.gopkg/scheduler/plugins/proportion/capacity_policy/capacity_policy.gopkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.gopkg/scheduler/plugins/proportion/proportion.gopkg/scheduler/plugins/proportion/proportion_test.gopkg/scheduler/plugins/proportion/reclaimable/reclaimable.gopkg/scheduler/plugins/proportion/reclaimable/reclaimable_test.gopkg/scheduler/plugins/proportion/reclaimable/reclaimer_info.gopkg/scheduler/plugins/proportion/reclaimable/strategies/strategies.gopkg/scheduler/plugins/proportion/reclaimable/strategies/strategies_test.gopkg/scheduler/plugins/proportion/utils/utils.gopkg/scheduler/plugins/resourcetype/resourcetype.gopkg/scheduler/plugins/topology/job_filtering.gopkg/scheduler/plugins/topology/job_filtering_test.gopkg/scheduler/plugins/topology/node_scoring_test.gopkg/scheduler/plugins/topology/topology_plugin.gopkg/scheduler/plugins/topology/topology_structs.go
4a7b05d to
fb1e847
Compare
Merging this branch changes the coverage (4 decrease, 6 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
fb1e847 to
51f1e17
Compare
Merging this branch changes the coverage (4 decrease, 6 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch changes the coverage (5 decrease, 4 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
…eads to vectors Convert all read-path methods from Resource to ResourceVector operations: - NodeInfo: IsTaskAllocatable, FittingError, GetSumOfIdleGPUs, IsCPUOnlyNode - External plugins: proportion, topology, nodeplacement, nodeavailability, resourcetype - Framework: session logging, statement references - Error handling: pod_errors, job_errors Add AcceptedResourceVector to PodInfo. Add QuantifyVector util to proportion plugin. Rewrite topology calcNodeAccommodation from iterative pod probing to division-based vector approach. Resource fields still maintained via dual-write for backward compatibility until removal in subsequent commits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… pod accommodation
- Thread vectorMap through getJobRatioToFreeResources, sortTree, sortTreeFromRoot to avoid allocating a new ResourceVectorMap on every call in the hot scheduling path - Reorder gpuIdx guard before gpuRequest extraction for clarity - Improve comment explaining releasing memory handling in fractional GPU capacity calculation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sons - Fix log messages in allocate.go that said "requires: %v GPUs" but now print a full resource vector - Rewrite lessEqualVectorsExcludingGPU to iterate and skip the GPU index instead of mutating input vectors via save/zero/compare/restore, which is not safe under concurrent access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CPUIndex, MemoryIndex, GPUIndex, and PodsIndex constants to replace GetIndex calls for core resources, which are always at fixed positions since NewResourceVectorMap guarantees their insertion order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9534600 to
8f7d6b3
Compare
Merging this branch changes the coverage (5 decrease, 4 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Change GetIndex, AddResource, ResourceAt, and internal fields to use v1.ResourceName instead of string, removing unnecessary string casts at call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…MIG GPU summation Add TotalGPUs method on ResourceVector that sums regular GPUs and MIG device portions, replacing duplicated logic in GetSumOfIdleGPUs, GetSumOfReleasingGPUs, and QuantifyVector. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…topology filtering Extract the shared GPU fraction summation logic into NodeInfo.AvailableSharedGPUFractions() and simplify calcNodeAccommodation to use it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Merging this branch changes the coverage (5 decrease, 4 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
#1146) Signed-off-by: Erez Freiberger <enoodle@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…eads to vectors
Description
Convert all read-path methods from Resource to ResourceVector operations:
Add AcceptedResourceVector to PodInfo. Add QuantifyVector util to proportion plugin. Rewrite topology calcNodeAccommodation from iterative pod probing to division-based vector approach.
Resource fields still maintained via dual-write for backward compatibility until removal in subsequent commits.
Related Issues
Fixes #
Checklist
Summary by CodeRabbit
Release Notes