-
Notifications
You must be signed in to change notification settings - Fork 152
fix(scheduler): scheduler Dra simulations #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…to the clusterInfo session object 2- Fix the podInfo calculation in the maxNodeResources to take dra claims into account 3- Calculate better the relationship between pods and resourceClaims in the scheduler 4- Remember the previous ResourceClaimInfo for the different statement operations. This allows us to calculate correctly the undo operations for simulations with dra resources.
📝 WalkthroughWalkthroughThis update enhances DRA scheduler handling by converting ResourceClaimInfo from slice-based to map-based storage, introducing utilities for Pod-ResourceClaim mapping, and integrating claim-aware evaluation in predicate logic and allocation tracking. The changes span data structures, utilities, framework state management, and plugin enhancements. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Performance Benchmark ResultsComparing PR ( Legend
Raw benchmark dataPR branch: Main branch: |
Merging this branch changes the coverage (3 decrease, 3 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@pkg/scheduler/api/pod_info/pod_info.go`:
- Around line 178-182: NewTaskInfoWithBindRequest currently swallows errors from
calcResourceClaimInfo and returns nil, causing callers (e.g., code that reads
PodInfo.ResReq.GPUs(), PodInfo.NodeName, or TaskInfo.Status) to dereference nil;
change NewTaskInfoWithBindRequest to propagate the error instead of returning
nil by updating its signature to return (*PodInfo, error), return the error from
calcResourceClaimInfo, and adjust all callers (those that call
NewTaskInfoWithBindRequest and then access PodInfo fields) to handle the error
path (check for non-nil error before dereferencing, return/handle failure
appropriately). Ensure references to PodInfo, NewTaskInfoWithBindRequest,
calcResourceClaimInfo, and TaskInfo.Status are updated consistently.
In `@pkg/scheduler/api/resource_info/dra_resource_utils.go`:
- Around line 59-65: Move the existence check before dereferencing the claim:
when retrieving claim := draClaimMap[*claimReference.ResourceClaimName] you must
first test found (the boolean) before accessing claim.Namespace; only after if
found { if claim.Namespace != pod.Namespace { continue }
addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) } so that claim is never
dereferenced when nil.
- Around line 17-23: ResourceClaimSliceToMap currently keys ResourceClaims only
by Name causing collisions across namespaces; change it to use a composite key
like "<namespace>/<name>" (e.g., draClaim.Namespace + "/" + draClaim.Name) so
namespace-scoped ResourceClaims are uniquely identified, and update any callers
that expect the old key (e.g., lookups in GetDraPodClaims) to use the same
composite key; alternatively restrict callers to pass only namespace-filtered
claims, but the simplest fix is to modify ResourceClaimSliceToMap to build keys
as namespace/name.
In `@pkg/scheduler/framework/statement.go`:
- Line 137: commitEvict captures previousResourceClaimInfo by direct assignment
which risks later mutation; change the capture to clone/deep-copy
reclaimee.ResourceClaimInfo just like Evict and Pipeline do so the saved
previousResourceClaimInfo is immutable if unevict is later called. Locate the
previousResourceClaimInfo := reclaimee.ResourceClaimInfo in commitEvict and
replace it with the same cloning call or helper used in Evict/Pipeline (use the
existing clone/deep-copy function for ResourceClaimInfo) so the restored state
in unevict is correct.
In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go`:
- Around line 289-294: The code reads claim.Status.Allocation.Devices.Results[0]
into deviceAllocated without guarding against an empty slice which can panic;
update the logic around deviceAllocated (the claim.Status.Allocation != nil
check and the assignment to deviceAllocated) to first verify
claim.Status.Allocation.Devices is non-nil and
len(claim.Status.Allocation.Devices.Results) > 0 before indexing, otherwise
leave deviceAllocated as empty (or use a safe default), then log as before with
log.InfraLogger.V(3).Infof to avoid runtime panics.
- Around line 347-365: In validateAllocatedState, avoid the out-of-bounds access
and use idiomatic set creation: when iterating resourceClaims (in function
validateAllocatedState) check that resourceClaim.Status.Allocation is non-nil
and that Allocation.Devices.Results has length > 0 before referencing
Results[0].Device; replace make(sets.Set[string]) usages with sets.New[string]()
for both resourceClaimDevices and allocatedStateDevices to follow idiomatic
initialization; keep the existing mismatch logging but ensure you only insert
device IDs after the nil/length checks when populating resourceClaimDevices and
allocatedStateDevices from allocatedState.AllocatedDevices.
- Around line 318-321: The code reads
claim.Status.Allocation.Devices.Results[0].Device into deviceRemoved without
checking for nil/empty slices; update the block in dynamicresources.go (the
handling around deviceRemoved and claim.Status.Allocation) to first nil-check
claim.Status.Allocation and claim.Status.Allocation.Devices and verify
len(claim.Status.Allocation.Devices.Results) > 0 before accessing [0], and
handle the empty case (e.g., leave deviceRemoved empty or log/return) to avoid a
potential panic when Results has no entries.
🧹 Nitpick comments (8)
CHANGELOG.md (1)
45-46: Consider adding author attribution for consistency.Other entries in this changelog include author attributions (e.g.,
[davidLif](https://github.com/davidLif)). Consider adding the author link for consistency with the surrounding entries.Suggested change
-- fix DRA scheduler simulations and the maxNodeResources predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966) +- fix DRA scheduler simulations and the maxNodeResources predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966) [davidLif](https://github.com/davidLif)pkg/scheduler/test_utils/test_utils_builder.go (3)
344-368: IncorrectAPIVersionfor ResourceClaim.The
APIVersionis set to"v1"but ResourceClaim belongs to theresource.k8s.io/v1API group. While this may not affect tests using fake clients, it's technically incorrect.♻️ Proposed fix
resourceClaimObject := resourceapi.ResourceClaim{ TypeMeta: metav1.TypeMeta{ Kind: "ResourceClaim", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", },
378-416: IncorrectAPIVersionfor ResourceSlice.Same issue as ResourceClaim - the
APIVersionshould be"resource.k8s.io/v1"instead of"v1".♻️ Proposed fix
resourceSliceObject := resourceapi.ResourceSlice{ TypeMeta: metav1.TypeMeta{ Kind: "ResourceSlice", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", },
418-435: IncorrectAPIVersionfor DeviceClass.Same issue - the
APIVersionshould be"resource.k8s.io/v1"instead of"v1".♻️ Proposed fix
deviceClassObject := resourceapi.DeviceClass{ TypeMeta: metav1.TypeMeta{ Kind: "DeviceClass", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", },pkg/scheduler/api/bindrequest_info/binrequest_info.go (1)
57-66: Map iteration order is non-deterministic - consider sorting for determinism.The
ToSlice()method iterates over a map without sorting, which produces different slice orders on each execution. While this doesn't affect functionality (consumers process claims independently and tests don't rely on order), sorting before returning would improve test determinism and debugging consistency.pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go (1)
50-50: Consider adding tests with non-empty resourceClaims.All test cases use empty
resourceClaimsslices. Consider adding test cases that verify behavior when pods reference ResourceClaims to ensure the DRA integration works correctly.pkg/scheduler/k8s_internal/predicates/maxNodeResources.go (2)
20-20: Import should be grouped with other external dependencies.Per coding guidelines, imports should be organized in three groups: (1) Standard library, (2) External dependencies, (3) Internal packages. The
resourceapiimport should be moved up with other k8s.io imports.♻️ Suggested import organization
import ( "context" "fmt" "strings" "github.com/dustin/go-humanize" v1 "k8s.io/api/core/v1" + resourceapi "k8s.io/api/resource/v1" "k8s.io/apimachinery/pkg/types" ksf "k8s.io/kube-scheduler/framework" k8sframework "k8s.io/kubernetes/pkg/scheduler/framework" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/node_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info" - resourceapi "k8s.io/api/resource/v1" )
93-93: TODO: Incomplete DRA resource slice validation.This TODO indicates that validation of resource slice capacity against pod claim requests is not yet implemented. Tracking for future completion.
Would you like me to open an issue to track implementing this resource slice validation?
| resourceClaimInfo, err := calcResourceClaimInfo(draPodClaims, pod) | ||
| if err != nil { | ||
| log.InfraLogger.Errorf("PodInfo ctor failure - failed to calculate resource claim info for pod %s/%s: %v", pod.Namespace, pod.Name, err) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all callers of NewTaskInfo and NewTaskInfoWithBindRequest
rg -n 'NewTaskInfo(WithBindRequest)?\s*\(' --type=go -g '!*_test.go' | head -50Repository: NVIDIA/KAI-Scheduler
Length of output: 943
🏁 Script executed:
#!/bin/bash
# Examine each caller to see if they handle nil returns
echo "=== maxNodeResources.go:63 ==="
sed -n '55,75p' pkg/scheduler/k8s_internal/predicates/maxNodeResources.go
echo -e "\n=== cluster_info.go:438 ==="
sed -n '430,450p' pkg/scheduler/cache/cluster_info/cluster_info.go
echo -e "\n=== cluster_info.go:459 ==="
sed -n '450,475p' pkg/scheduler/cache/cluster_info/cluster_info.go
echo -e "\n=== jobs_fake.go:200 ==="
sed -n '190,210p' pkg/scheduler/test_utils/jobs_fake/jobs.goRepository: NVIDIA/KAI-Scheduler
Length of output: 3813
🏁 Script executed:
#!/bin/bash
# Examine the calcResourceClaimInfo function to see when it returns error
sed -n '211,240p' pkg/scheduler/api/pod_info/pod_info.goRepository: NVIDIA/KAI-Scheduler
Length of output: 1343
🏁 Script executed:
#!/bin/bash
# Check the NewTaskInfoWithBindRequest function to see the exact line where nil is returned on error
sed -n '170,210p' pkg/scheduler/api/pod_info/pod_info.goRepository: NVIDIA/KAI-Scheduler
Length of output: 1869
Fix nil pointer dereference risk by handling error cases in NewTaskInfoWithBindRequest.
The function returns nil when calcResourceClaimInfo fails (e.g., when GetResourceClaimName encounters an error). All four callers directly dereference the returned *PodInfo without nil checks:
maxNodeResources.go:63usespodInfo.ResReq.GPUs()cluster_info.go:438returnspodInfodirectlycluster_info.go:459usespodInfo.NodeNamejobs_fake.go:200usestaskInfo.Status
This will cause panics if calcResourceClaimInfo fails. Either handle the error within the constructor or propagate it to callers so they can gracefully handle pod creation failures.
🤖 Prompt for AI Agents
In `@pkg/scheduler/api/pod_info/pod_info.go` around lines 178 - 182,
NewTaskInfoWithBindRequest currently swallows errors from calcResourceClaimInfo
and returns nil, causing callers (e.g., code that reads PodInfo.ResReq.GPUs(),
PodInfo.NodeName, or TaskInfo.Status) to dereference nil; change
NewTaskInfoWithBindRequest to propagate the error instead of returning nil by
updating its signature to return (*PodInfo, error), return the error from
calcResourceClaimInfo, and adjust all callers (those that call
NewTaskInfoWithBindRequest and then access PodInfo fields) to handle the error
path (check for non-nil error before dereferencing, return/handle failure
appropriately). Ensure references to PodInfo, NewTaskInfoWithBindRequest,
calcResourceClaimInfo, and TaskInfo.Status are updated consistently.
| claim, found := draClaimMap[*claimReference.ResourceClaimName] | ||
| if claim.Namespace != pod.Namespace { | ||
| continue | ||
| } | ||
| if found { | ||
| addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nil pointer dereference: claim accessed before checking found.
Line 60 accesses claim.Namespace but claim may be nil when found is false. The found check on Line 63 should happen before accessing claim.
🐛 Proposed fix
for _, claimReference := range pod.Spec.ResourceClaims {
if claimReference.ResourceClaimName == nil {
continue
}
claim, found := draClaimMap[*claimReference.ResourceClaimName]
- if claim.Namespace != pod.Namespace {
- continue
- }
- if found {
+ if found && claim.Namespace == pod.Namespace {
addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap)
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| claim, found := draClaimMap[*claimReference.ResourceClaimName] | |
| if claim.Namespace != pod.Namespace { | |
| continue | |
| } | |
| if found { | |
| addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) | |
| } | |
| claim, found := draClaimMap[*claimReference.ResourceClaimName] | |
| if found && claim.Namespace == pod.Namespace { | |
| addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) | |
| } |
🤖 Prompt for AI Agents
In `@pkg/scheduler/api/resource_info/dra_resource_utils.go` around lines 59 - 65,
Move the existence check before dereferencing the claim: when retrieving claim
:= draClaimMap[*claimReference.ResourceClaimName] you must first test found (the
boolean) before accessing claim.Namespace; only after if found { if
claim.Namespace != pod.Namespace { continue } addClaimToPodClaimMap(claim,
pod.UID, podsToClaimsMap) } so that claim is never dereferenced when nil.
|
|
||
| previousStatus := reclaimee.Status | ||
| previousGpuGroup := reclaimee.GPUGroups | ||
| previousResourceClaimInfo := reclaimee.ResourceClaimInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent: commitEvict does not clone ResourceClaimInfo.
In Evict (Line 84) and Pipeline (Line 241), ResourceClaimInfo is cloned before capture. Here, it's assigned directly. If unevict is called and the task's ResourceClaimInfo was modified in between, the restoration may be incorrect.
🛡️ Suggested fix for consistency
previousStatus := reclaimee.Status
previousGpuGroup := reclaimee.GPUGroups
- previousResourceClaimInfo := reclaimee.ResourceClaimInfo
+ var previousResourceClaimInfo bindrequest_info.ResourceClaimInfo
+ if reclaimee.ResourceClaimInfo != nil {
+ previousResourceClaimInfo = reclaimee.ResourceClaimInfo.Clone()
+ }
previousIsVirtualStatus := reclaimee.IsVirtualStatus📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| previousResourceClaimInfo := reclaimee.ResourceClaimInfo | |
| previousStatus := reclaimee.Status | |
| previousGpuGroup := reclaimee.GPUGroups | |
| var previousResourceClaimInfo bindrequest_info.ResourceClaimInfo | |
| if reclaimee.ResourceClaimInfo != nil { | |
| previousResourceClaimInfo = reclaimee.ResourceClaimInfo.Clone() | |
| } | |
| previousIsVirtualStatus := reclaimee.IsVirtualStatus |
🤖 Prompt for AI Agents
In `@pkg/scheduler/framework/statement.go` at line 137, commitEvict captures
previousResourceClaimInfo by direct assignment which risks later mutation;
change the capture to clone/deep-copy reclaimee.ResourceClaimInfo just like
Evict and Pipeline do so the saved previousResourceClaimInfo is immutable if
unevict is later called. Locate the previousResourceClaimInfo :=
reclaimee.ResourceClaimInfo in commitEvict and replace it with the same cloning
call or helper used in Evict/Pipeline (use the existing clone/deep-copy function
for ResourceClaimInfo) so the restored state in unevict is correct.
| deviceAllocated := "" | ||
| if claim.Status.Allocation != nil { | ||
| deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device | ||
| } | ||
|
|
||
| log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential panic if Devices.Results is empty.
Accessing Results[0] without checking the slice length could cause a panic if claim.Status.Allocation.Devices.Results is empty.
🐛 Proposed fix
deviceAllocated := ""
if claim.Status.Allocation != nil {
+ if len(claim.Status.Allocation.Devices.Results) > 0 {
deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| deviceAllocated := "" | |
| if claim.Status.Allocation != nil { | |
| deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device | |
| } | |
| log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory) | |
| deviceAllocated := "" | |
| if claim.Status.Allocation != nil { | |
| if len(claim.Status.Allocation.Devices.Results) > 0 { | |
| deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device | |
| } | |
| } | |
| log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory) |
🤖 Prompt for AI Agents
In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 289
- 294, The code reads claim.Status.Allocation.Devices.Results[0] into
deviceAllocated without guarding against an empty slice which can panic; update
the logic around deviceAllocated (the claim.Status.Allocation != nil check and
the assignment to deviceAllocated) to first verify
claim.Status.Allocation.Devices is non-nil and
len(claim.Status.Allocation.Devices.Results) > 0 before indexing, otherwise
leave deviceAllocated as empty (or use a safe default), then log as before with
log.InfraLogger.V(3).Infof to avoid runtime panics.
| var deviceRemoved string | ||
| if claim.Status.Allocation != nil { | ||
| deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same potential panic when extracting deviceRemoved.
Same issue as the allocation path - accessing Results[0] without a length check.
🐛 Proposed fix
var deviceRemoved string
if claim.Status.Allocation != nil {
+ if len(claim.Status.Allocation.Devices.Results) > 0 {
deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var deviceRemoved string | |
| if claim.Status.Allocation != nil { | |
| deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device | |
| } | |
| var deviceRemoved string | |
| if claim.Status.Allocation != nil { | |
| if len(claim.Status.Allocation.Devices.Results) > 0 { | |
| deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 318
- 321, The code reads claim.Status.Allocation.Devices.Results[0].Device into
deviceRemoved without checking for nil/empty slices; update the block in
dynamicresources.go (the handling around deviceRemoved and
claim.Status.Allocation) to first nil-check claim.Status.Allocation and
claim.Status.Allocation.Devices and verify
len(claim.Status.Allocation.Devices.Results) > 0 before accessing [0], and
handle the empty case (e.g., leave deviceRemoved empty or log/return) to avoid a
potential panic when Results has no entries.
| func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) { | ||
| resourceClaims, err := drap.manager.ResourceClaims().List() | ||
| if err != nil { | ||
| log.InfraLogger.Errorf("failed to list all resource claims: %v", err) | ||
| } | ||
| resourceClaimDevices := make(sets.Set[string]) | ||
| for _, resourceClaim := range resourceClaims { | ||
| if resourceClaim.Status.Allocation != nil { | ||
| resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device) | ||
| } | ||
| } | ||
| allocatedStateDevices := make(sets.Set[string]) | ||
| for deviceID := range allocatedState.AllocatedDevices { | ||
| allocatedStateDevices.Insert(deviceID.Device.String()) | ||
| } | ||
| if !resourceClaimDevices.Equal(allocatedStateDevices) { | ||
| log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation function has same out-of-bounds risk and uses non-idiomatic set initialization.
- Line 355 accesses
Results[0]without checking length. - Lines 352 and 358 use
make(sets.Set[string])instead of the idiomaticsets.New[string]().
🐛 Proposed fix
func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) {
resourceClaims, err := drap.manager.ResourceClaims().List()
if err != nil {
log.InfraLogger.Errorf("failed to list all resource claims: %v", err)
+ return
}
- resourceClaimDevices := make(sets.Set[string])
+ resourceClaimDevices := sets.New[string]()
for _, resourceClaim := range resourceClaims {
- if resourceClaim.Status.Allocation != nil {
+ if resourceClaim.Status.Allocation != nil && len(resourceClaim.Status.Allocation.Devices.Results) > 0 {
resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device)
}
}
- allocatedStateDevices := make(sets.Set[string])
+ allocatedStateDevices := sets.New[string]()
for deviceID := range allocatedState.AllocatedDevices {
allocatedStateDevices.Insert(deviceID.Device.String())
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) { | |
| resourceClaims, err := drap.manager.ResourceClaims().List() | |
| if err != nil { | |
| log.InfraLogger.Errorf("failed to list all resource claims: %v", err) | |
| } | |
| resourceClaimDevices := make(sets.Set[string]) | |
| for _, resourceClaim := range resourceClaims { | |
| if resourceClaim.Status.Allocation != nil { | |
| resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device) | |
| } | |
| } | |
| allocatedStateDevices := make(sets.Set[string]) | |
| for deviceID := range allocatedState.AllocatedDevices { | |
| allocatedStateDevices.Insert(deviceID.Device.String()) | |
| } | |
| if !resourceClaimDevices.Equal(allocatedStateDevices) { | |
| log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices) | |
| } | |
| } | |
| func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) { | |
| resourceClaims, err := drap.manager.ResourceClaims().List() | |
| if err != nil { | |
| log.InfraLogger.Errorf("failed to list all resource claims: %v", err) | |
| return | |
| } | |
| resourceClaimDevices := sets.New[string]() | |
| for _, resourceClaim := range resourceClaims { | |
| if resourceClaim.Status.Allocation != nil && len(resourceClaim.Status.Allocation.Devices.Results) > 0 { | |
| resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device) | |
| } | |
| } | |
| allocatedStateDevices := sets.New[string]() | |
| for deviceID := range allocatedState.AllocatedDevices { | |
| allocatedStateDevices.Insert(deviceID.Device.String()) | |
| } | |
| if !resourceClaimDevices.Equal(allocatedStateDevices) { | |
| log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices) | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 347
- 365, In validateAllocatedState, avoid the out-of-bounds access and use
idiomatic set creation: when iterating resourceClaims (in function
validateAllocatedState) check that resourceClaim.Status.Allocation is non-nil
and that Allocation.Devices.Results has length > 0 before referencing
Results[0].Device; replace make(sets.Set[string]) usages with sets.New[string]()
for both resourceClaimDevices and allocatedStateDevices to follow idiomatic
initialization; keep the existing mismatch logging but ensure you only insert
device IDs after the nil/length checks when populating resourceClaimDevices and
allocatedStateDevices from allocatedState.AllocatedDevices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/scheduler/test_utils/test_utils_builder.go (1)
378-434:⚠️ Potential issue | 🟡 MinorUse resource.k8s.io/v1 apiVersion for ResourceSlice and DeviceClass.
Both ResourceSlice and DeviceClass are part of Kubernetes' Dynamic Resource Allocation and belong to the
resource.k8s.ioAPI group. The apiVersion must include the group:resource.k8s.io/v1, not justv1.Proposed fix
resourceSliceObject := resourceapi.ResourceSlice{ TypeMeta: metav1.TypeMeta{ Kind: "ResourceSlice", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", }, @@ deviceClassObject := resourceapi.DeviceClass{ TypeMeta: metav1.TypeMeta{ Kind: "DeviceClass", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", },
🤖 Fix all issues with AI agents
In `@CHANGELOG.md`:
- Line 45: Update the changelog entry text to past tense and match surrounding
style by changing "- fix DRA scheduler simulations and the maxNodeResources
predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966)" to use
"Fixed DRA scheduler simulations and the MaxNodeResources predicate" so it reads
past-tense and capitalizes the predicate name consistently.
In `@pkg/scheduler/api/bindrequest_info/binrequest_info.go`:
- Around line 43-55: ResourceClaimInfo.Clone() can panic when iterating
ResourceClaimAllocation entries because info.Allocation may be nil; update the
Clone implementation in ResourceClaimInfo.Clone() so that for each info you
check if info.Allocation == nil and set newrci[i].Allocation = nil in that case,
otherwise call info.Allocation.DeepCopy() and assign the result; preserve
copying of Name and ensure newrci entries are allocated as
&schedulingv1alpha2.ResourceClaimAllocation before assigning Allocation.
In `@pkg/scheduler/api/resource_info/dra_resource_utils.go`:
- Around line 17-23: Add a GoDoc comment for the exported function
ResourceClaimSliceToMap: write a comment immediately above the function that
starts with "ResourceClaimSliceToMap" and briefly describes what the function
does (converts a slice of *resourceapi.ResourceClaim into a map keyed by claim
Name), mention the input (draResourceClaims slice) and the return value
(map[string]*resourceapi.ResourceClaim), and keep it concise and formatted as a
proper GoDoc sentence.
In `@pkg/scheduler/k8s_internal/predicates/maxNodeResources.go`:
- Around line 62-64: The code calls resource_info.GetDraPodClaims(...) and then
pod_info.NewTaskInfo(...), but NewTaskInfo can return nil (e.g., when
calcResourceClaimInfo fails) and podInfo is dereferenced afterwards; add a
nil-check after podInfo is created (from pod_info.NewTaskInfo) and handle the
error path (return false / skip this pod / log an error as appropriate for the
predicate) to avoid a panic—locate references to podInfo in this function (and
any subsequent dereferences) and guard them with the nil-check.
In `@pkg/scheduler/test_utils/test_utils_builder.go`:
- Around line 341-376: getResourceClaims is creating resourceapi.ResourceClaim
objects with the wrong APIVersion and no UID, causing collisions in
podsToClaimsMap; update the TypeMeta.APIVersion to "resource.k8s.io/v1" and
populate ObjectMeta.UID from the source test claim (e.g., use resourceClaim.UID
or generate a stable non-empty UID per test claim) so each created
resourceapi.ResourceClaim has a unique metadata. Ensure you set ObjectMeta.UID
on the resourceClaimObject before appending to objects and keep TypeMeta.Kind as
"ResourceClaim".
🧹 Nitpick comments (2)
pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go (1)
307-314: Consider adding test coverage for DRA claims.The existing tests verify backward compatibility with empty
resourceClaims, but there's no coverage for the new DRA claim filtering logic. Consider adding tests that exerciseGetDraPodClaimsand the claim-aware code path in a follow-up.pkg/scheduler/k8s_internal/predicates/maxNodeResources.go (1)
17-21: Import organization does not follow guidelines.The
resourceapiimport on line 20 should be grouped with other external dependencies (lines 11-15), not placed after internal imports.♻️ Proposed fix
import ( "context" "fmt" "strings" "github.com/dustin/go-humanize" v1 "k8s.io/api/core/v1" + resourceapi "k8s.io/api/resource/v1" "k8s.io/apimachinery/pkg/types" ksf "k8s.io/kube-scheduler/framework" k8sframework "k8s.io/kubernetes/pkg/scheduler/framework" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/node_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info" "github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info" - resourceapi "k8s.io/api/resource/v1" )As per coding guidelines: "Organize imports in three groups separated by blank lines: (1) Standard library, (2) External dependencies, (3) Internal packages".
| - Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected [#924](https://github.com/NVIDIA/KAI-Scheduler/pull/924) [itsomri](https://github.com/itsomri) | ||
| - cpu-only nodes calculation in DRA enabled clusters [#944](https://github.com/NVIDIA/KAI-Scheduler/pull/944) | ||
| - enable DRA flag override fix in snapshot-tool [#955](https://github.com/NVIDIA/KAI-Scheduler/pull/955) | ||
| - fix DRA scheduler simulations and the maxNodeResources predicate [#966](https://github.com/NVIDIA/KAI-Scheduler/pull/966) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use past-tense "Fixed" for consistency.
The "Fixed" section uses past-tense entries; consider "Fixed DRA scheduler simulations and the MaxNodeResources predicate" to match the surrounding style.
🤖 Prompt for AI Agents
In `@CHANGELOG.md` at line 45, Update the changelog entry text to past tense and
match surrounding style by changing "- fix DRA scheduler simulations and the
maxNodeResources predicate
[`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966)" to use "Fixed DRA
scheduler simulations and the MaxNodeResources predicate" so it reads past-tense
and capitalizes the predicate name consistently.
| func (rci ResourceClaimInfo) Clone() ResourceClaimInfo { | ||
| if rci == nil { | ||
| return nil | ||
| } | ||
| newrci := make(ResourceClaimInfo, len(rci)) | ||
| for i, info := range rci { | ||
| newrci[i] = schedulingv1alpha2.ResourceClaimAllocation{ | ||
| newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{ | ||
| Name: info.Name, | ||
| Allocation: info.Allocation.DeepCopy(), | ||
| } | ||
| } | ||
| return newrci | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference in Clone() when Allocation is nil.
ResourceClaimAllocation.Allocation is a pointer that can be nil (per the type definition in bindrequest_types.go). Calling info.Allocation.DeepCopy() on a nil Allocation will panic.
🐛 Proposed fix
func (rci ResourceClaimInfo) Clone() ResourceClaimInfo {
if rci == nil {
return nil
}
newrci := make(ResourceClaimInfo, len(rci))
for i, info := range rci {
+ var allocationCopy *v1.AllocationResult
+ if info.Allocation != nil {
+ allocationCopy = info.Allocation.DeepCopy()
+ }
newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{
Name: info.Name,
- Allocation: info.Allocation.DeepCopy(),
+ Allocation: allocationCopy,
}
}
return newrci
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (rci ResourceClaimInfo) Clone() ResourceClaimInfo { | |
| if rci == nil { | |
| return nil | |
| } | |
| newrci := make(ResourceClaimInfo, len(rci)) | |
| for i, info := range rci { | |
| newrci[i] = schedulingv1alpha2.ResourceClaimAllocation{ | |
| newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{ | |
| Name: info.Name, | |
| Allocation: info.Allocation.DeepCopy(), | |
| } | |
| } | |
| return newrci | |
| } | |
| func (rci ResourceClaimInfo) Clone() ResourceClaimInfo { | |
| if rci == nil { | |
| return nil | |
| } | |
| newrci := make(ResourceClaimInfo, len(rci)) | |
| for i, info := range rci { | |
| var allocationCopy *v1.AllocationResult | |
| if info.Allocation != nil { | |
| allocationCopy = info.Allocation.DeepCopy() | |
| } | |
| newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{ | |
| Name: info.Name, | |
| Allocation: allocationCopy, | |
| } | |
| } | |
| return newrci | |
| } |
🤖 Prompt for AI Agents
In `@pkg/scheduler/api/bindrequest_info/binrequest_info.go` around lines 43 - 55,
ResourceClaimInfo.Clone() can panic when iterating ResourceClaimAllocation
entries because info.Allocation may be nil; update the Clone implementation in
ResourceClaimInfo.Clone() so that for each info you check if info.Allocation ==
nil and set newrci[i].Allocation = nil in that case, otherwise call
info.Allocation.DeepCopy() and assign the result; preserve copying of Name and
ensure newrci entries are allocated as
&schedulingv1alpha2.ResourceClaimAllocation before assigning Allocation.
| func ResourceClaimSliceToMap(draResourceClaims []*resourceapi.ResourceClaim) map[string]*resourceapi.ResourceClaim { | ||
| draClaimMap := map[string]*resourceapi.ResourceClaim{} | ||
| for _, draClaim := range draResourceClaims { | ||
| draClaimMap[draClaim.Name] = draClaim | ||
| } | ||
| return draClaimMap | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "dra_resource_utils.go" -type fRepository: NVIDIA/KAI-Scheduler
Length of output: 119
🏁 Script executed:
head -30 pkg/scheduler/api/resource_info/dra_resource_utils.goRepository: NVIDIA/KAI-Scheduler
Length of output: 1345
Add GoDoc comment for exported function ResourceClaimSliceToMap.
This exported function is missing a GoDoc header, which is required by the coding guidelines and impacts generated documentation completeness.
Suggested GoDoc
+// ResourceClaimSliceToMap converts a slice of ResourceClaims into a lookup map.
func ResourceClaimSliceToMap(draResourceClaims []*resourceapi.ResourceClaim) map[string]*resourceapi.ResourceClaim {🤖 Prompt for AI Agents
In `@pkg/scheduler/api/resource_info/dra_resource_utils.go` around lines 17 - 23,
Add a GoDoc comment for the exported function ResourceClaimSliceToMap: write a
comment immediately above the function that starts with
"ResourceClaimSliceToMap" and briefly describes what the function does (converts
a slice of *resourceapi.ResourceClaim into a map keyed by claim Name), mention
the input (draResourceClaims slice) and the return value
(map[string]*resourceapi.ResourceClaim), and keep it concise and formatted as a
proper GoDoc sentence.
| draPodClaims := resource_info.GetDraPodClaims(pod, mnr.resourceClaimsMap, mnr.podsToClaimsMap) | ||
| podInfo := pod_info.NewTaskInfo(pod, draPodClaims...) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing nil check for podInfo can cause panic.
pod_info.NewTaskInfo can return nil if calcResourceClaimInfo fails. The subsequent code dereferences podInfo without checking for nil, which will cause a panic.
🐛 Proposed fix
draPodClaims := resource_info.GetDraPodClaims(pod, mnr.resourceClaimsMap, mnr.podsToClaimsMap)
podInfo := pod_info.NewTaskInfo(pod, draPodClaims...)
+ if podInfo == nil {
+ return nil, ksf.NewStatus(ksf.UnschedulableAndUnresolvable,
+ fmt.Sprintf("failed to create pod info for pod %s/%s", pod.Namespace, pod.Name))
+ }
if podInfo.ResReq.GPUs() > mnr.maxResources.GPUs() {🤖 Prompt for AI Agents
In `@pkg/scheduler/k8s_internal/predicates/maxNodeResources.go` around lines 62 -
64, The code calls resource_info.GetDraPodClaims(...) and then
pod_info.NewTaskInfo(...), but NewTaskInfo can return nil (e.g., when
calcResourceClaimInfo fails) and podInfo is dereferenced afterwards; add a
nil-check after podInfo is created (from pod_info.NewTaskInfo) and handle the
error path (return false / skip this pod / log an error as appropriate for the
predicate) to avoid a panic—locate references to podInfo in this function (and
any subsequent dereferences) and guard them with the nil-check.
| func getResourceClaims(testMetadata TestTopologyBasic) []*resourceapi.ResourceClaim { | ||
| var objects []*resourceapi.ResourceClaim | ||
| for _, resourceClaim := range testMetadata.ResourceClaims { | ||
| resourceClaimObject := resourceapi.ResourceClaim{ | ||
| TypeMeta: metav1.TypeMeta{ | ||
| Kind: "DeviceClass", | ||
| Kind: "ResourceClaim", | ||
| APIVersion: "v1", | ||
| }, | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: deviceClass, | ||
| Name: resourceClaim.Name, | ||
| Namespace: resourceClaim.Namespace, | ||
| ResourceVersion: "0", | ||
| Labels: resourceClaim.Labels, | ||
| }, | ||
| Spec: resourceapi.ResourceClaimSpec{ | ||
| Devices: resourceapi.DeviceClaim{ | ||
| Requests: []resourceapi.DeviceRequest{ | ||
| { | ||
| Name: "request", | ||
| Exactly: &resourceapi.ExactDeviceRequest{ | ||
| DeviceClassName: resourceClaim.DeviceClassName, | ||
| AllocationMode: resourceapi.DeviceAllocationModeExactCount, | ||
| Count: resourceClaim.Count, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| Spec: resourceapi.DeviceClassSpec{}, | ||
| } | ||
| objects = append(objects, &deviceClassObject) | ||
| if resourceClaim.ClaimStatus != nil { | ||
| resourceClaimObject.Status = *resourceClaim.ClaimStatus | ||
| } | ||
| objects = append(objects, &resourceClaimObject) | ||
| } | ||
| return objects | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for podsToClaimsMap to understand how it indexes claims
rg "podsToClaimsMap" -A 3 -B 3Repository: NVIDIA/KAI-Scheduler
Length of output: 8623
🏁 Script executed:
# Search for ResourceClaim usage in the scheduler code
rg "ResourceClaim" -t go --max-count=20 | head -40Repository: NVIDIA/KAI-Scheduler
Length of output: 4489
🏁 Script executed:
# Look for the test_utils_builder file and examine similar patterns
fd "test_utils_builder.go" -x cat -n {} | head -100Repository: NVIDIA/KAI-Scheduler
Length of output: 4319
🏁 Script executed:
# Search for how other resources are created in test utilities with UIDs
rg "types\.UID" -t go -B 2 -A 2 | head -50Repository: NVIDIA/KAI-Scheduler
Length of output: 4566
🏁 Script executed:
# Check how ResourceClaim is created in test utilities
cat -n test/e2e/modules/resources/rd/resource_claim.goRepository: NVIDIA/KAI-Scheduler
Length of output: 2672
🏁 Script executed:
# Search for APIVersion patterns in ResourceClaim creation
rg "APIVersion.*resource" -t go -B 2 -A 2Repository: NVIDIA/KAI-Scheduler
Length of output: 773
🏁 Script executed:
# Look at how other test utilities set ObjectMeta in test_utils_builder
rg "ObjectMeta.*metav1" -A 10 pkg/scheduler/test_utils/test_utils_builder.go | head -60Repository: NVIDIA/KAI-Scheduler
Length of output: 1915
🏁 Script executed:
# Check the actual addClaimToPodClaimMap implementation
rg "addClaimToPodClaimMap" -A 10 pkg/scheduler/api/resource_info/dra_resource_utils.goRepository: NVIDIA/KAI-Scheduler
Length of output: 1396
🏁 Script executed:
# Check if there's actual use of ResourceClaim APIVersion in the codebase
rg "resource\.k8s\.io" -t go --max-count=10Repository: NVIDIA/KAI-Scheduler
Length of output: 1255
🏁 Script executed:
# Look for Kubernetes API version patterns in actual Kubernetes object creation
rg "APIVersion.*=.*\"" -t go pkg/scheduler/test_utils/test_utils_builder.go | head -20Repository: NVIDIA/KAI-Scheduler
Length of output: 46
🏁 Script executed:
# Check if there are any ResourceClaim creation patterns in tests that show UID or APIVersion
rg "ResourceClaim" -A 8 pkg/scheduler/test_utils/test_utils_builder.go | head -50Repository: NVIDIA/KAI-Scheduler
Length of output: 1420
🏁 Script executed:
# Look at how the test utilizes ResourceClaims - search for getResourceClaims usage
rg "getResourceClaims" -B 3 -A 3Repository: NVIDIA/KAI-Scheduler
Length of output: 1752
🏁 Script executed:
# Check what types.UID expects and how NamespacedName.String() works
python3 << 'EOF'
# types.UID is just a string type in Kubernetes
# Let's see what NamespacedName.String() produces
namespace = "default"
name = "my-claim"
namespaced_name_string = f"{namespace}/{name}"
print(f"NamespacedName.String() would produce: {namespaced_name_string}")
print(f"This format looks correct for a test UID")
EOFRepository: NVIDIA/KAI-Scheduler
Length of output: 160
🏁 Script executed:
# Verify if the proposed UID format is actually appropriate by checking similar patterns
rg "types\.UID\(" pkg/scheduler/test_utils/test_utils_builder.go -A 1 -B 1Repository: NVIDIA/KAI-Scheduler
Length of output: 441
🏁 Script executed:
# Check how UIDs are handled when ResourceClaims are actually used in tests
rg "claim\.UID" -B 3 -A 3 pkg/scheduler/api/resource_info/dra_resource_utils.goRepository: NVIDIA/KAI-Scheduler
Length of output: 223
🏁 Script executed:
# Verify the current APIVersion value in the code
sed -n '341,376p' pkg/scheduler/test_utils/test_utils_builder.go | grep -A 50 "getResourceClaims"Repository: NVIDIA/KAI-Scheduler
Length of output: 1183
🏁 Script executed:
# Check if types.NamespacedName is available in test_utils_builder.go
rg "types\." pkg/scheduler/test_utils/test_utils_builder.go | head -10Repository: NVIDIA/KAI-Scheduler
Length of output: 160
🏁 Script executed:
# Verify the exact line numbers and context around the function
sed -n '336,380p' pkg/scheduler/test_utils/test_utils_builder.goRepository: NVIDIA/KAI-Scheduler
Length of output: 1396
🏁 Script executed:
# Check if there are any tests that depend on multiple ResourceClaims to verify the collapse issue
rg "ResourceClaims" pkg/scheduler/test_utils/test_utils_builder.go -B 2 -A 2Repository: NVIDIA/KAI-Scheduler
Length of output: 693
🏁 Script executed:
# Check the imports in test_utils_builder.go
head -40 pkg/scheduler/test_utils/test_utils_builder.go | grep import -A 30Repository: NVIDIA/KAI-Scheduler
Length of output: 1330
🏁 Script executed:
# Check if apimachinery/pkg/types provides NamespacedName
rg "NamespacedName" pkg/scheduler/test_utils/test_utils_builder.goRepository: NVIDIA/KAI-Scheduler
Length of output: 46
🏁 Script executed:
# Look for how this pattern is used elsewhere in the codebase
rg "NamespacedName.*String" -B 2 -A 2Repository: NVIDIA/KAI-Scheduler
Length of output: 1415
Set ResourceClaim UID and correct apiVersion to match resource.k8s.io API group.
The podsToClaimsMap indexes claims by claim.UID; leaving it unset causes all claims to collide at the empty-string key, collapsing multiple test claims into a single entry. Additionally, ResourceClaim objects belong to the resource.k8s.io/v1 API group, not v1.
🐛 Proposed fix
resourceClaimObject := resourceapi.ResourceClaim{
TypeMeta: metav1.TypeMeta{
Kind: "ResourceClaim",
- APIVersion: "v1",
+ APIVersion: "resource.k8s.io/v1",
},
ObjectMeta: metav1.ObjectMeta{
Name: resourceClaim.Name,
Namespace: resourceClaim.Namespace,
+ UID: types.UID(types.NamespacedName{Namespace: resourceClaim.Namespace, Name: resourceClaim.Name}.String()),
ResourceVersion: "0",
Labels: resourceClaim.Labels,
},🤖 Prompt for AI Agents
In `@pkg/scheduler/test_utils/test_utils_builder.go` around lines 341 - 376,
getResourceClaims is creating resourceapi.ResourceClaim objects with the wrong
APIVersion and no UID, causing collisions in podsToClaimsMap; update the
TypeMeta.APIVersion to "resource.k8s.io/v1" and populate ObjectMeta.UID from the
source test claim (e.g., use resourceClaim.UID or generate a stable non-empty
UID per test claim) so each created resourceapi.ResourceClaim has a unique
metadata. Ensure you set ObjectMeta.UID on the resourceClaimObject before
appending to objects and keep TypeMeta.Kind as "ResourceClaim".
| // CalcClaimsToPodsBaseMap calculates a pod to claims map. The first key in the map is the pod UID, and the second key in the map is the claim UID. | ||
| // The relationships are calculated based on the resourceclaim owner references and allocation reservedFor references. This might not be enough to identify all the claims that are related to a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: is there a way to simplify this explaination? maybe an example of the map instead?
| mnr.buildUnschedulableMessage(podInfo, string(rName), float64(maxVal), units)) | ||
| } | ||
| } | ||
| // TODO: check if any of the resource slices good for the node can satisfy the pod's claim requests (device count for the device class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a reference in the CHANGELOG?
|
|
||
| // If the claim info has already been allocated in the past (the deallocation was virtual), recover previous allocation data | ||
| allocatedFromMemory := false | ||
| if claimAllocationInfo, ok := task.ResourceClaimInfo[podClaim.Name]; ok && claimAllocationInfo.Allocation != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUESTION: Do we want to somehow tag a previously-allocated resourceClaim?
maybe represent it in a snapshot?
Description
1- Fix scheduler integration test session build - add resourceClaimsto the clusterInfo session object
2- Fix the podInfo calculation in the maxNodeResources to take dra claims into account
3- Calculate better the relationship between pods and resourceClaims in the scheduler
4- Remember the previous ResourceClaimInfo for the different statement operations. This allows us to calculate correctly the undo operations for simulations with dra resources.
Related Issues
Fixes #
Checklist
Breaking Changes
Additional Notes
Summary by CodeRabbit
Release Notes