fix(scheduler): scheduler Dra simulations #966

davidLif · 2026-02-04T23:00:00Z

Description

1- Fix scheduler integration test session build - add resourceClaimsto the clusterInfo session object
2- Fix the podInfo calculation in the maxNodeResources to take dra claims into account
3- Calculate better the relationship between pods and resourceClaims in the scheduler
4- Remember the previous ResourceClaimInfo for the different statement operations. This allows us to calculate correctly the undo operations for simulations with dra resources.

Related Issues

Fixes #

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

Self-reviewed
Added/updated tests (if needed)
Updated documentation (if needed)

Breaking Changes

Additional Notes

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed DRA scheduler simulations and maxNodeResources predicate
- Improved resource claim allocation tracking during pod lifecycle transitions (eviction, pipelining, commitment)
- Enhanced validation to ensure device allocations remain consistent with recorded claims
- Better handling of allocation recovery and propagation from scheduler memory

…to the clusterInfo session object 2- Fix the podInfo calculation in the maxNodeResources to take dra claims into account 3- Calculate better the relationship between pods and resourceClaims in the scheduler 4- Remember the previous ResourceClaimInfo for the different statement operations. This allows us to calculate correctly the undo operations for simulations with dra resources.

coderabbitai · 2026-02-04T23:00:11Z

📝 Walkthrough

Walkthrough

This update enhances DRA scheduler handling by converting ResourceClaimInfo from slice-based to map-based storage, introducing utilities for Pod-ResourceClaim mapping, and integrating claim-aware evaluation in predicate logic and allocation tracking. The changes span data structures, utilities, framework state management, and plugin enhancements.

Changes

Cohort / File(s)	Summary
Data Structure Refactoring `pkg/scheduler/api/bindrequest_info/binrequest_info.go`, `pkg/scheduler/api/pod_info/pod_info.go`	Converted ResourceClaimInfo from slice to map representation; added ToSlice() conversion method and calcResourceClaimInfo helper to populate claims during task construction.
DRA Utility Functions `pkg/scheduler/api/resource_info/dra_resource_utils.go`	New utility file introducing ResourceClaimSliceToMap, CalcClaimsToPodsBaseMap, and GetDraPodClaims functions for mapping ResourceClaims to Pods via owner references and ReservedFor relationships.
Cache and Cluster Integration `pkg/scheduler/cache/cache.go`, `pkg/scheduler/cache/cluster_info/cluster_info.go`	Updated BindRequest creation to use ToSlice() conversion; refactored cluster_info to delegate claim-mapping logic to resource_info utilities, removing legacy helper implementations.
Framework State Management `pkg/scheduler/framework/operations.go`, `pkg/scheduler/framework/statement.go`	Added previousResourceClaimInfo field to evict and pipeline operations; enhanced eviction/pipeline/unpipeline/unevict flows to preserve and restore ResourceClaimInfo during rollback scenarios.
Predicate Enhancements `pkg/scheduler/k8s_internal/predicates/maxNodeResources.go`, `pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go`, `pkg/scheduler/k8s_internal/predicates/predicates.go`	Extended MaxNodeResourcesPredicate to accept resourceClaims parameter and maintain claim-to-pod mappings; updated PreFilter to derive pod claims via GetDraPodClaims; added placeholder for claim-based matching logic.
Plugin and Testing Updates `pkg/scheduler/plugins/dynamicresources/dynamicresources.go`, `pkg/scheduler/test_utils/test_utils_builder.go`, `CHANGELOG.md`	Enhanced dynamicresources plugin with memory-aware allocation tracking, claim entry removal on deallocation, and allocated-state validation; updated test infrastructure to support ResourceClaims in cluster state; added changelog entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Poem

🐰 From slices to maps, our claims now dance,
With utilities binding each pod's romance,
Through eviction's deep weave, we remember the past,
As predicates measure resources—at last!
The scheduler hops forward, more graceful, more bright,
DRA claims aligned in dynamic delight! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(scheduler): scheduler Dra simulations' is related to the PR's main objective but is somewhat generic. It describes the general fix category (DRA simulations) but doesn't specify the particular improvements like resourceClaims handling, maxNodeResources fixes, or previous ResourceClaimInfo persistence.
Description check	✅ Passed	The description provides clear bullet points explaining the four main fixes: resourceClaims integration, maxNodeResources fix, pod-resourceClaim relationship calculation, and previous ResourceClaimInfo persistence. The description follows the template structure with Description, Related Issues, and Checklist sections completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch DavidL/32885/fixDraSimulations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-04T23:07:31Z

📊 Performance Benchmark Results

Comparing PR (DavidL/32885/fixDraSimulations) vs main branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                    │ main-bench.txt │            pr-bench.txt            │
                                    │     sec/op     │    sec/op     vs base              │
AllocateAction_SmallCluster-4           108.2m ±  1%   108.2m ± 11%       ~ (p=0.818 n=6)
AllocateAction_MediumCluster-4          135.2m ±  2%   136.8m ±  2%       ~ (p=0.485 n=6)
AllocateAction_LargeCluster-4           224.4m ± 20%   221.5m ± 28%       ~ (p=0.699 n=6)
ReclaimAction_SmallCluster-4            102.9m ±  0%   102.9m ±  0%       ~ (p=0.065 n=6)
ReclaimAction_MediumCluster-4           105.6m ±  0%   105.5m ±  0%       ~ (p=0.394 n=6)
PreemptAction_SmallCluster-4            103.7m ±  0%   103.7m ±  0%       ~ (p=0.310 n=6)
PreemptAction_MediumCluster-4           112.3m ±  0%   112.1m ±  0%       ~ (p=0.310 n=6)
ConsolidationAction_SmallCluster-4      113.5m ±  0%   113.3m ±  0%       ~ (p=0.093 n=6)
ConsolidationAction_MediumCluster-4     202.6m ±  2%   205.3m ±  2%       ~ (p=0.240 n=6)
FullSchedulingCycle_SmallCluster-4      105.3m ±  0%   105.3m ±  0%       ~ (p=0.699 n=6)
FullSchedulingCycle_MediumCluster-4     118.7m ±  0%   118.5m ±  0%       ~ (p=0.240 n=6)
FullSchedulingCycle_LargeCluster-4      159.0m ±  1%   159.4m ±  1%       ~ (p=0.310 n=6)
ManyQueues_MediumCluster-4              139.7m ±  0%   139.5m ±  0%       ~ (p=0.240 n=6)
GangScheduling_MediumCluster-4          154.3m ±  2%   155.1m ±  3%       ~ (p=0.589 n=6)
geomean                                 130.4m         130.5m        +0.10%

                                    │ main-bench.txt │            pr-bench.txt            │
                                    │      B/op      │     B/op      vs base              │
AllocateAction_SmallCluster-4           2.154Mi ± 0%   2.158Mi ± 1%  +0.20% (p=0.002 n=6)
AllocateAction_MediumCluster-4          11.84Mi ± 0%   11.87Mi ± 0%  +0.21% (p=0.002 n=6)
AllocateAction_LargeCluster-4           41.54Mi ± 0%   41.60Mi ± 0%  +0.15% (p=0.002 n=6)
ReclaimAction_SmallCluster-4            888.9Ki ± 1%   892.7Ki ± 1%       ~ (p=0.180 n=6)
ReclaimAction_MediumCluster-4           2.828Mi ± 0%   2.842Mi ± 0%  +0.50% (p=0.002 n=6)
PreemptAction_SmallCluster-4            1.003Mi ± 0%   1.008Mi ± 1%       ~ (p=0.093 n=6)
PreemptAction_MediumCluster-4           4.018Mi ± 0%   4.032Mi ± 0%  +0.36% (p=0.002 n=6)
ConsolidationAction_SmallCluster-4      5.604Mi ± 0%   5.613Mi ± 0%  +0.16% (p=0.004 n=6)
ConsolidationAction_MediumCluster-4     46.89Mi ± 0%   46.93Mi ± 0%  +0.09% (p=0.002 n=6)
FullSchedulingCycle_SmallCluster-4      1.371Mi ± 0%   1.377Mi ± 0%  +0.42% (p=0.041 n=6)
FullSchedulingCycle_MediumCluster-4     6.836Mi ± 0%   6.855Mi ± 0%  +0.27% (p=0.002 n=6)
FullSchedulingCycle_LargeCluster-4      22.83Mi ± 0%   22.88Mi ± 0%  +0.20% (p=0.002 n=6)
ManyQueues_MediumCluster-4              16.31Mi ± 0%   16.33Mi ± 0%  +0.14% (p=0.002 n=6)
GangScheduling_MediumCluster-4          17.17Mi ± 0%   17.22Mi ± 0%  +0.28% (p=0.002 n=6)
geomean                                 6.328Mi        6.346Mi       +0.28%

                                    │ main-bench.txt │           pr-bench.txt            │
                                    │   allocs/op    │  allocs/op   vs base              │
AllocateAction_SmallCluster-4            36.21k ± 0%   36.41k ± 0%  +0.55% (p=0.002 n=6)
AllocateAction_MediumCluster-4           325.2k ± 0%   326.0k ± 0%  +0.25% (p=0.002 n=6)
AllocateAction_LargeCluster-4            1.394M ± 0%   1.396M ± 0%  +0.14% (p=0.002 n=6)
ReclaimAction_SmallCluster-4             8.396k ± 0%   8.498k ± 0%  +1.22% (p=0.002 n=6)
ReclaimAction_MediumCluster-4            26.54k ± 0%   26.94k ± 0%  +1.51% (p=0.002 n=6)
PreemptAction_SmallCluster-4             11.19k ± 0%   11.29k ± 0%  +0.92% (p=0.002 n=6)
PreemptAction_MediumCluster-4            38.77k ± 0%   39.17k ± 0%  +1.04% (p=0.002 n=6)
ConsolidationAction_SmallCluster-4       73.57k ± 0%   73.85k ± 0%  +0.38% (p=0.002 n=6)
ConsolidationAction_MediumCluster-4      685.9k ± 0%   687.1k ± 0%  +0.18% (p=0.002 n=6)
FullSchedulingCycle_SmallCluster-4       21.36k ± 0%   21.52k ± 0%  +0.72% (p=0.002 n=6)
FullSchedulingCycle_MediumCluster-4      174.7k ± 0%   175.3k ± 0%  +0.35% (p=0.002 n=6)
FullSchedulingCycle_LargeCluster-4       727.3k ± 0%   728.8k ± 0%  +0.21% (p=0.002 n=6)
ManyQueues_MediumCluster-4               363.3k ± 0%   364.1k ± 0%  +0.22% (p=0.002 n=6)
GangScheduling_MediumCluster-4           597.0k ± 0%   598.6k ± 0%  +0.27% (p=0.002 n=6)
geomean                                  111.7k        112.3k       +0.57%

Legend

📉 Negative delta = Performance improvement (faster)
📈 Positive delta = Performance regression (slower)
p-value < 0.05 indicates statistically significant change

Raw benchmark data

PR branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	       9	 119971376 ns/op	 2289164 B/op	   36426 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108289884 ns/op	 2262615 B/op	   36404 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108210905 ns/op	 2261983 B/op	   36404 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107546557 ns/op	 2262960 B/op	   36408 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107558076 ns/op	 2261331 B/op	   36404 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108282800 ns/op	 2263927 B/op	   36411 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134737599 ns/op	12446391 B/op	  326004 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 139491130 ns/op	12442751 B/op	  325991 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 138953352 ns/op	12442533 B/op	  325995 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 139268280 ns/op	12443429 B/op	  325999 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133992099 ns/op	12442964 B/op	  325993 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134477537 ns/op	12444709 B/op	  325985 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 218560931 ns/op	43623384 B/op	 1396308 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 226418829 ns/op	43642505 B/op	 1396289 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 283786219 ns/op	43621297 B/op	 1396292 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 224488273 ns/op	43626158 B/op	 1396283 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 213881738 ns/op	43619992 B/op	 1396285 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 216535159 ns/op	43620264 B/op	 1396289 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102875704 ns/op	  908541 B/op	    8468 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102827020 ns/op	  914476 B/op	    8490 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102906860 ns/op	  913504 B/op	    8498 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102802311 ns/op	  913719 B/op	    8498 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102892956 ns/op	  922522 B/op	    8499 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102824034 ns/op	  917536 B/op	    8500 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105329845 ns/op	 2982308 B/op	   26941 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105488675 ns/op	 2978389 B/op	   26939 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105702494 ns/op	 2982437 B/op	   26941 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105720760 ns/op	 2974421 B/op	   26938 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105490829 ns/op	 2974533 B/op	   26938 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105434493 ns/op	 2982564 B/op	   26943 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103765394 ns/op	 1051338 B/op	   11287 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103654571 ns/op	 1055405 B/op	   11289 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103768443 ns/op	 1059016 B/op	   11289 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103678400 ns/op	 1061922 B/op	   11289 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103671944 ns/op	 1059629 B/op	   11292 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103663227 ns/op	 1055236 B/op	   11288 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112321203 ns/op	 4228318 B/op	   39174 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112217842 ns/op	 4223840 B/op	   39171 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112158565 ns/op	 4223806 B/op	   39171 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112139040 ns/op	 4228400 B/op	   39174 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112066722 ns/op	 4228142 B/op	   39173 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112118075 ns/op	 4228531 B/op	   39174 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113337147 ns/op	 5886368 B/op	   73844 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113442694 ns/op	 5881027 B/op	   73827 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113294517 ns/op	 5895984 B/op	   73853 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113154643 ns/op	 5887139 B/op	   73869 allocs/op

Main branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	      10	 109006532 ns/op	 2259839 B/op	   36214 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108575052 ns/op	 2260101 B/op	   36207 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108289279 ns/op	 2259322 B/op	   36211 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108206988 ns/op	 2255631 B/op	   36204 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107959962 ns/op	 2257428 B/op	   36204 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107994069 ns/op	 2254784 B/op	   36201 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135644454 ns/op	12417930 B/op	  325195 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137373267 ns/op	12418747 B/op	  325187 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134709406 ns/op	12416247 B/op	  325190 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134021311 ns/op	12417247 B/op	  325195 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137006917 ns/op	12416827 B/op	  325190 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134112756 ns/op	12418114 B/op	  325197 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 218571073 ns/op	43557035 B/op	 1394294 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 228361967 ns/op	43557556 B/op	 1394298 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 229834305 ns/op	43558878 B/op	 1394304 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       4	 268769722 ns/op	43554976 B/op	 1394283 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 220528580 ns/op	43564595 B/op	 1394302 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 212497340 ns/op	43556329 B/op	 1394285 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103029652 ns/op	  905532 B/op	    8369 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102835455 ns/op	  907077 B/op	    8385 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102943084 ns/op	  914124 B/op	    8398 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102914704 ns/op	  910165 B/op	    8395 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102963855 ns/op	  910276 B/op	    8396 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102851584 ns/op	  915226 B/op	    8396 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105690433 ns/op	 2961788 B/op	   26537 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105686582 ns/op	 2969468 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105644726 ns/op	 2965567 B/op	   26538 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105606985 ns/op	 2965615 B/op	   26538 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105634644 ns/op	 2969386 B/op	   26539 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105533160 ns/op	 2961642 B/op	   26536 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103620016 ns/op	 1048140 B/op	   11185 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103694326 ns/op	 1056028 B/op	   11189 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103620157 ns/op	 1051917 B/op	   11186 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103657392 ns/op	 1051986 B/op	   11187 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103699628 ns/op	 1054871 B/op	   11186 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103667977 ns/op	 1051928 B/op	   11187 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112319235 ns/op	 4211023 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112230665 ns/op	 4215323 B/op	   38771 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112463048 ns/op	 4215373 B/op	   38773 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112007983 ns/op	 4215188 B/op	   38771 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112403909 ns/op	 4210955 B/op	   38769 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112145171 ns/op	 4210939 B/op	   38770 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113313938 ns/op	 5877696 B/op	   73572 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113586268 ns/op	 5875883 B/op	   73559 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113373952 ns/op	 5877410 B/op	   73597 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113257059 ns/op	 5874836 B/op	   73556 allocs/op

github-actions · 2026-02-04T23:45:02Z

Merging this branch changes the coverage (3 decrease, 3 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/bindrequest_info	0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info	58.57% (-1.43%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info	36.48% (-4.37%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/cache	34.25% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/cache/cluster_info	83.64% (+3.30%)	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/framework	33.50% (+0.62%)	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/k8s_internal/predicates	59.46% (+0.84%)	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/dynamicresources	28.21% (-5.64%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/test_utils	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/bindrequest_info/binrequest_info.go	0.00% (ø)	28 (+6)	0	28 (+6)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info/pod_info.go	50.00% (-1.18%)	142 (+15)	71 (+6)	71 (+9)	👎
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info/dra_resource_utils.go	0.00% (ø)	34 (+34)	0	34 (+34)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/cache/cache.go	52.71% (ø)	129	68	61
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/cache/cluster_info/cluster_info.go	85.66% (+5.32%)	265 (-30)	227 (-10)	38 (-20)	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/framework/operations.go	100.00% (ø)	12	12	0
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/framework/statement.go	50.35% (+1.24%)	288 (+9)	145 (+8)	143 (+1)	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/k8s_internal/predicates/maxNodeResources.go	94.59% (+0.48%)	37 (+3)	35 (+3)	2	👍
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/k8s_internal/predicates/predicates.go	0.00% (ø)	48	0	48
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/dynamicresources/dynamicresources.go	28.21% (-5.64%)	156 (+26)	44	112 (+26)	👎
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/test_utils/test_utils_builder.go	0.00% (ø)	129 (+15)	0	129 (+15)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go

davidLif · 2026-02-05T06:48:12Z

@coderabbitai full review

coderabbitai · 2026-02-05T06:48:18Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@pkg/scheduler/api/pod_info/pod_info.go`:
- Around line 178-182: NewTaskInfoWithBindRequest currently swallows errors from
calcResourceClaimInfo and returns nil, causing callers (e.g., code that reads
PodInfo.ResReq.GPUs(), PodInfo.NodeName, or TaskInfo.Status) to dereference nil;
change NewTaskInfoWithBindRequest to propagate the error instead of returning
nil by updating its signature to return (*PodInfo, error), return the error from
calcResourceClaimInfo, and adjust all callers (those that call
NewTaskInfoWithBindRequest and then access PodInfo fields) to handle the error
path (check for non-nil error before dereferencing, return/handle failure
appropriately). Ensure references to PodInfo, NewTaskInfoWithBindRequest,
calcResourceClaimInfo, and TaskInfo.Status are updated consistently.

In `@pkg/scheduler/api/resource_info/dra_resource_utils.go`:
- Around line 59-65: Move the existence check before dereferencing the claim:
when retrieving claim := draClaimMap[*claimReference.ResourceClaimName] you must
first test found (the boolean) before accessing claim.Namespace; only after if
found { if claim.Namespace != pod.Namespace { continue }
addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) } so that claim is never
dereferenced when nil.
- Around line 17-23: ResourceClaimSliceToMap currently keys ResourceClaims only
by Name causing collisions across namespaces; change it to use a composite key
like "<namespace>/<name>" (e.g., draClaim.Namespace + "/" + draClaim.Name) so
namespace-scoped ResourceClaims are uniquely identified, and update any callers
that expect the old key (e.g., lookups in GetDraPodClaims) to use the same
composite key; alternatively restrict callers to pass only namespace-filtered
claims, but the simplest fix is to modify ResourceClaimSliceToMap to build keys
as namespace/name.

In `@pkg/scheduler/framework/statement.go`:
- Line 137: commitEvict captures previousResourceClaimInfo by direct assignment
which risks later mutation; change the capture to clone/deep-copy
reclaimee.ResourceClaimInfo just like Evict and Pipeline do so the saved
previousResourceClaimInfo is immutable if unevict is later called. Locate the
previousResourceClaimInfo := reclaimee.ResourceClaimInfo in commitEvict and
replace it with the same cloning call or helper used in Evict/Pipeline (use the
existing clone/deep-copy function for ResourceClaimInfo) so the restored state
in unevict is correct.

In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go`:
- Around line 289-294: The code reads claim.Status.Allocation.Devices.Results[0]
into deviceAllocated without guarding against an empty slice which can panic;
update the logic around deviceAllocated (the claim.Status.Allocation != nil
check and the assignment to deviceAllocated) to first verify
claim.Status.Allocation.Devices is non-nil and
len(claim.Status.Allocation.Devices.Results) > 0 before indexing, otherwise
leave deviceAllocated as empty (or use a safe default), then log as before with
log.InfraLogger.V(3).Infof to avoid runtime panics.
- Around line 347-365: In validateAllocatedState, avoid the out-of-bounds access
and use idiomatic set creation: when iterating resourceClaims (in function
validateAllocatedState) check that resourceClaim.Status.Allocation is non-nil
and that Allocation.Devices.Results has length > 0 before referencing
Results[0].Device; replace make(sets.Set[string]) usages with sets.New[string]()
for both resourceClaimDevices and allocatedStateDevices to follow idiomatic
initialization; keep the existing mismatch logging but ensure you only insert
device IDs after the nil/length checks when populating resourceClaimDevices and
allocatedStateDevices from allocatedState.AllocatedDevices.
- Around line 318-321: The code reads
claim.Status.Allocation.Devices.Results[0].Device into deviceRemoved without
checking for nil/empty slices; update the block in dynamicresources.go (the
handling around deviceRemoved and claim.Status.Allocation) to first nil-check
claim.Status.Allocation and claim.Status.Allocation.Devices and verify
len(claim.Status.Allocation.Devices.Results) > 0 before accessing [0], and
handle the empty case (e.g., leave deviceRemoved empty or log/return) to avoid a
potential panic when Results has no entries.

🧹 Nitpick comments (8)

CHANGELOG.md (1)
45-46: Consider adding author attribution for consistency.

Other entries in this changelog include author attributions (e.g., [davidLif](https://github.com/davidLif)). Consider adding the author link for consistency with the surrounding entries.
Suggested change
-- fix DRA scheduler simulations and the maxNodeResources predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966) 
+- fix DRA scheduler simulations and the maxNodeResources predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966) [davidLif](https://github.com/davidLif)
pkg/scheduler/test_utils/test_utils_builder.go (3)
344-368: Incorrect APIVersion for ResourceClaim.

The APIVersion is set to "v1" but ResourceClaim belongs to the resource.k8s.io/v1 API group. While this may not affect tests using fake clients, it's technically incorrect.
♻️ Proposed fix
 		resourceClaimObject := resourceapi.ResourceClaim{
 			TypeMeta: metav1.TypeMeta{
 				Kind:       "ResourceClaim",
-				APIVersion: "v1",
+				APIVersion: "resource.k8s.io/v1",
 			},
378-416: Incorrect APIVersion for ResourceSlice.

Same issue as ResourceClaim - the APIVersion should be "resource.k8s.io/v1" instead of "v1".
♻️ Proposed fix
 		resourceSliceObject := resourceapi.ResourceSlice{
 			TypeMeta: metav1.TypeMeta{
 				Kind:       "ResourceSlice",
-				APIVersion: "v1",
+				APIVersion: "resource.k8s.io/v1",
 			},
418-435: Incorrect APIVersion for DeviceClass.

Same issue - the APIVersion should be "resource.k8s.io/v1" instead of "v1".
♻️ Proposed fix
 		deviceClassObject := resourceapi.DeviceClass{
 			TypeMeta: metav1.TypeMeta{
 				Kind:       "DeviceClass",
-				APIVersion: "v1",
+				APIVersion: "resource.k8s.io/v1",
 			},
pkg/scheduler/api/bindrequest_info/binrequest_info.go (1)

57-66: Map iteration order is non-deterministic - consider sorting for determinism.

The ToSlice() method iterates over a map without sorting, which produces different slice orders on each execution. While this doesn't affect functionality (consumers process claims independently and tests don't rely on order), sorting before returning would improve test determinism and debugging consistency.

pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go (1)

50-50: Consider adding tests with non-empty resourceClaims.

All test cases use empty resourceClaims slices. Consider adding test cases that verify behavior when pods reference ResourceClaims to ensure the DRA integration works correctly.
pkg/scheduler/k8s_internal/predicates/maxNodeResources.go (2)
20-20: Import should be grouped with other external dependencies.

Per coding guidelines, imports should be organized in three groups: (1) Standard library, (2) External dependencies, (3) Internal packages. The resourceapi import should be moved up with other k8s.io imports.
♻️ Suggested import organization
 import (
 	"context"
 	"fmt"
 	"strings"

 	"github.com/dustin/go-humanize"
 	v1 "k8s.io/api/core/v1"
+	resourceapi "k8s.io/api/resource/v1"
 	"k8s.io/apimachinery/pkg/types"
 	ksf "k8s.io/kube-scheduler/framework"
 	k8sframework "k8s.io/kubernetes/pkg/scheduler/framework"

 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/node_info"
 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info"
 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info"
-	resourceapi "k8s.io/api/resource/v1"
 )
93-93: TODO: Incomplete DRA resource slice validation.

This TODO indicates that validation of resource slice capacity against pod claim requests is not yet implemented. Tracking for future completion.

Would you like me to open an issue to track implementing this resource slice validation?

coderabbitai · 2026-02-05T06:52:00Z

pkg/scheduler/api/pod_info/pod_info.go

+	resourceClaimInfo, err := calcResourceClaimInfo(draPodClaims, pod)
+	if err != nil {
+		log.InfraLogger.Errorf("PodInfo ctor failure - failed to calculate resource claim info for pod %s/%s: %v", pod.Namespace, pod.Name, err)
+		return nil
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all callers of NewTaskInfo and NewTaskInfoWithBindRequest rg -n 'NewTaskInfo(WithBindRequest)?\s*\(' --type=go -g '!*_test.go' | head -50

Repository: NVIDIA/KAI-Scheduler

Length of output: 943

🏁 Script executed:

#!/bin/bash # Examine each caller to see if they handle nil returns echo "=== maxNodeResources.go:63 ===" sed -n '55,75p' pkg/scheduler/k8s_internal/predicates/maxNodeResources.go echo -e "\n=== cluster_info.go:438 ===" sed -n '430,450p' pkg/scheduler/cache/cluster_info/cluster_info.go echo -e "\n=== cluster_info.go:459 ===" sed -n '450,475p' pkg/scheduler/cache/cluster_info/cluster_info.go echo -e "\n=== jobs_fake.go:200 ===" sed -n '190,210p' pkg/scheduler/test_utils/jobs_fake/jobs.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 3813

🏁 Script executed:

#!/bin/bash # Examine the calcResourceClaimInfo function to see when it returns error sed -n '211,240p' pkg/scheduler/api/pod_info/pod_info.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 1343

🏁 Script executed:

#!/bin/bash # Check the NewTaskInfoWithBindRequest function to see the exact line where nil is returned on error sed -n '170,210p' pkg/scheduler/api/pod_info/pod_info.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 1869

Fix nil pointer dereference risk by handling error cases in NewTaskInfoWithBindRequest.

The function returns nil when calcResourceClaimInfo fails (e.g., when GetResourceClaimName encounters an error). All four callers directly dereference the returned *PodInfo without nil checks:

maxNodeResources.go:63 uses podInfo.ResReq.GPUs()

cluster_info.go:438 returns podInfo directly

cluster_info.go:459 uses podInfo.NodeName

jobs_fake.go:200 uses taskInfo.Status

This will cause panics if calcResourceClaimInfo fails. Either handle the error within the constructor or propagate it to callers so they can gracefully handle pod creation failures.

🤖 Prompt for AI Agents

In `@pkg/scheduler/api/pod_info/pod_info.go` around lines 178 - 182, NewTaskInfoWithBindRequest currently swallows errors from calcResourceClaimInfo and returns nil, causing callers (e.g., code that reads PodInfo.ResReq.GPUs(), PodInfo.NodeName, or TaskInfo.Status) to dereference nil; change NewTaskInfoWithBindRequest to propagate the error instead of returning nil by updating its signature to return (*PodInfo, error), return the error from calcResourceClaimInfo, and adjust all callers (those that call NewTaskInfoWithBindRequest and then access PodInfo fields) to handle the error path (check for non-nil error before dereferencing, return/handle failure appropriately). Ensure references to PodInfo, NewTaskInfoWithBindRequest, calcResourceClaimInfo, and TaskInfo.Status are updated consistently.

coderabbitai · 2026-02-05T06:52:00Z

pkg/scheduler/api/resource_info/dra_resource_utils.go

+		claim, found := draClaimMap[*claimReference.ResourceClaimName]
+		if claim.Namespace != pod.Namespace {
+			continue
+		}
+		if found {
+			addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap)
+		}


⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: claim accessed before checking found.

Line 60 accesses claim.Namespace but claim may be nil when found is false. The found check on Line 63 should happen before accessing claim.

🐛 Proposed fix

for _, claimReference := range pod.Spec.ResourceClaims { if claimReference.ResourceClaimName == nil { continue } claim, found := draClaimMap[*claimReference.ResourceClaimName] - if claim.Namespace != pod.Namespace { - continue - } - if found { + if found && claim.Namespace == pod.Namespace { addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

claim, found := draClaimMap[*claimReference.ResourceClaimName]

if claim.Namespace != pod.Namespace {

continue

}

if found {

addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap)

}

claim, found := draClaimMap[*claimReference.ResourceClaimName]

if found && claim.Namespace == pod.Namespace {

addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap)

}

🤖 Prompt for AI Agents

In `@pkg/scheduler/api/resource_info/dra_resource_utils.go` around lines 59 - 65, Move the existence check before dereferencing the claim: when retrieving claim := draClaimMap[*claimReference.ResourceClaimName] you must first test found (the boolean) before accessing claim.Namespace; only after if found { if claim.Namespace != pod.Namespace { continue } addClaimToPodClaimMap(claim, pod.UID, podsToClaimsMap) } so that claim is never dereferenced when nil.

coderabbitai · 2026-02-05T06:52:01Z

pkg/scheduler/framework/statement.go


 	previousStatus := reclaimee.Status
 	previousGpuGroup := reclaimee.GPUGroups
+	previousResourceClaimInfo := reclaimee.ResourceClaimInfo


⚠️ Potential issue | 🟡 Minor

Inconsistent: commitEvict does not clone ResourceClaimInfo.

In Evict (Line 84) and Pipeline (Line 241), ResourceClaimInfo is cloned before capture. Here, it's assigned directly. If unevict is called and the task's ResourceClaimInfo was modified in between, the restoration may be incorrect.

🛡️ Suggested fix for consistency

previousStatus := reclaimee.Status previousGpuGroup := reclaimee.GPUGroups - previousResourceClaimInfo := reclaimee.ResourceClaimInfo + var previousResourceClaimInfo bindrequest_info.ResourceClaimInfo + if reclaimee.ResourceClaimInfo != nil { + previousResourceClaimInfo = reclaimee.ResourceClaimInfo.Clone() + } previousIsVirtualStatus := reclaimee.IsVirtualStatus

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

previousResourceClaimInfo := reclaimee.ResourceClaimInfo

previousStatus := reclaimee.Status

previousGpuGroup := reclaimee.GPUGroups

var previousResourceClaimInfo bindrequest_info.ResourceClaimInfo

if reclaimee.ResourceClaimInfo != nil {

previousResourceClaimInfo = reclaimee.ResourceClaimInfo.Clone()

}

previousIsVirtualStatus := reclaimee.IsVirtualStatus

🤖 Prompt for AI Agents

In `@pkg/scheduler/framework/statement.go` at line 137, commitEvict captures previousResourceClaimInfo by direct assignment which risks later mutation; change the capture to clone/deep-copy reclaimee.ResourceClaimInfo just like Evict and Pipeline do so the saved previousResourceClaimInfo is immutable if unevict is later called. Locate the previousResourceClaimInfo := reclaimee.ResourceClaimInfo in commitEvict and replace it with the same cloning call or helper used in Evict/Pipeline (use the existing clone/deep-copy function for ResourceClaimInfo) so the restored state in unevict is correct.

coderabbitai · 2026-02-05T06:52:01Z

pkg/scheduler/plugins/dynamicresources/dynamicresources.go

+	deviceAllocated := ""
+	if claim.Status.Allocation != nil {
+		deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device
+	}
+
+	log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory)


⚠️ Potential issue | 🟠 Major

Potential panic if Devices.Results is empty.

Accessing Results[0] without checking the slice length could cause a panic if claim.Status.Allocation.Devices.Results is empty.

🐛 Proposed fix

deviceAllocated := "" if claim.Status.Allocation != nil { + if len(claim.Status.Allocation.Devices.Results) > 0 { deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

deviceAllocated := ""

if claim.Status.Allocation != nil {

deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device

}

log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory)

deviceAllocated := ""

if claim.Status.Allocation != nil {

if len(claim.Status.Allocation.Devices.Results) > 0 {

deviceAllocated = claim.Status.Allocation.Devices.Results[0].Device

}

}

log.InfraLogger.V(3).Infof("Allocated claim <%s/%s>, device <%s>, from memory: %t.", task.Namespace, claimName, deviceAllocated, allocatedFromMemory)

🤖 Prompt for AI Agents

In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 289 - 294, The code reads claim.Status.Allocation.Devices.Results[0] into deviceAllocated without guarding against an empty slice which can panic; update the logic around deviceAllocated (the claim.Status.Allocation != nil check and the assignment to deviceAllocated) to first verify claim.Status.Allocation.Devices is non-nil and len(claim.Status.Allocation.Devices.Results) > 0 before indexing, otherwise leave deviceAllocated as empty (or use a safe default), then log as before with log.InfraLogger.V(3).Infof to avoid runtime panics.

coderabbitai · 2026-02-05T06:52:01Z

pkg/scheduler/plugins/dynamicresources/dynamicresources.go

+	var deviceRemoved string
+	if claim.Status.Allocation != nil {
+		deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device
+	}


⚠️ Potential issue | 🟠 Major

Same potential panic when extracting deviceRemoved.

Same issue as the allocation path - accessing Results[0] without a length check.

🐛 Proposed fix

var deviceRemoved string if claim.Status.Allocation != nil { + if len(claim.Status.Allocation.Devices.Results) > 0 { deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var deviceRemoved string

if claim.Status.Allocation != nil {

deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device

}

var deviceRemoved string

if claim.Status.Allocation != nil {

if len(claim.Status.Allocation.Devices.Results) > 0 {

deviceRemoved = claim.Status.Allocation.Devices.Results[0].Device

}

}

🤖 Prompt for AI Agents

In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 318 - 321, The code reads claim.Status.Allocation.Devices.Results[0].Device into deviceRemoved without checking for nil/empty slices; update the block in dynamicresources.go (the handling around deviceRemoved and claim.Status.Allocation) to first nil-check claim.Status.Allocation and claim.Status.Allocation.Devices and verify len(claim.Status.Allocation.Devices.Results) > 0 before accessing [0], and handle the empty case (e.g., leave deviceRemoved empty or log/return) to avoid a potential panic when Results has no entries.

coderabbitai · 2026-02-05T06:52:01Z

pkg/scheduler/plugins/dynamicresources/dynamicresources.go

+func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) {
+	resourceClaims, err := drap.manager.ResourceClaims().List()
+	if err != nil {
+		log.InfraLogger.Errorf("failed to list all resource claims: %v", err)
+	}
+	resourceClaimDevices := make(sets.Set[string])
+	for _, resourceClaim := range resourceClaims {
+		if resourceClaim.Status.Allocation != nil {
+			resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device)
+		}
+	}
+	allocatedStateDevices := make(sets.Set[string])
+	for deviceID := range allocatedState.AllocatedDevices {
+		allocatedStateDevices.Insert(deviceID.Device.String())
+	}
+	if !resourceClaimDevices.Equal(allocatedStateDevices) {
+		log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices)
+	}
+}


⚠️ Potential issue | 🟠 Major

Validation function has same out-of-bounds risk and uses non-idiomatic set initialization.

Line 355 accesses Results[0] without checking length.

Lines 352 and 358 use make(sets.Set[string]) instead of the idiomatic sets.New[string]().

🐛 Proposed fix

func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) { resourceClaims, err := drap.manager.ResourceClaims().List() if err != nil { log.InfraLogger.Errorf("failed to list all resource claims: %v", err) + return } - resourceClaimDevices := make(sets.Set[string]) + resourceClaimDevices := sets.New[string]() for _, resourceClaim := range resourceClaims { - if resourceClaim.Status.Allocation != nil { + if resourceClaim.Status.Allocation != nil && len(resourceClaim.Status.Allocation.Devices.Results) > 0 { resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device) } } - allocatedStateDevices := make(sets.Set[string]) + allocatedStateDevices := sets.New[string]() for deviceID := range allocatedState.AllocatedDevices { allocatedStateDevices.Insert(deviceID.Device.String()) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) {

resourceClaims, err := drap.manager.ResourceClaims().List()

if err != nil {

log.InfraLogger.Errorf("failed to list all resource claims: %v", err)

}

resourceClaimDevices := make(sets.Set[string])

for _, resourceClaim := range resourceClaims {

if resourceClaim.Status.Allocation != nil {

resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device)

}

}

allocatedStateDevices := make(sets.Set[string])

for deviceID := range allocatedState.AllocatedDevices {

allocatedStateDevices.Insert(deviceID.Device.String())

}

if !resourceClaimDevices.Equal(allocatedStateDevices) {

log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices)

}

}

func (drap *draPlugin) validateAllocatedState(allocatedState *structured.AllocatedState) {

resourceClaims, err := drap.manager.ResourceClaims().List()

if err != nil {

log.InfraLogger.Errorf("failed to list all resource claims: %v", err)

return

}

resourceClaimDevices := sets.New[string]()

for _, resourceClaim := range resourceClaims {

if resourceClaim.Status.Allocation != nil && len(resourceClaim.Status.Allocation.Devices.Results) > 0 {

resourceClaimDevices.Insert(resourceClaim.Status.Allocation.Devices.Results[0].Device)

}

}

allocatedStateDevices := sets.New[string]()

for deviceID := range allocatedState.AllocatedDevices {

allocatedStateDevices.Insert(deviceID.Device.String())

}

if !resourceClaimDevices.Equal(allocatedStateDevices) {

log.InfraLogger.Errorf("allocated state does not match resource claim devices: %v", allocatedState.AllocatedDevices)

}

}

🤖 Prompt for AI Agents

In `@pkg/scheduler/plugins/dynamicresources/dynamicresources.go` around lines 347 - 365, In validateAllocatedState, avoid the out-of-bounds access and use idiomatic set creation: when iterating resourceClaims (in function validateAllocatedState) check that resourceClaim.Status.Allocation is non-nil and that Allocation.Devices.Results has length > 0 before referencing Results[0].Device; replace make(sets.Set[string]) usages with sets.New[string]() for both resourceClaimDevices and allocatedStateDevices to follow idiomatic initialization; keep the existing mismatch logging but ensure you only insert device IDs after the nil/length checks when populating resourceClaimDevices and allocatedStateDevices from allocatedState.AllocatedDevices.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/scheduler/test_utils/test_utils_builder.go (1)
378-434: ⚠️ Potential issue | 🟡 Minor

Use resource.k8s.io/v1 apiVersion for ResourceSlice and DeviceClass.

Both ResourceSlice and DeviceClass are part of Kubernetes' Dynamic Resource Allocation and belong to the resource.k8s.io API group. The apiVersion must include the group: resource.k8s.io/v1, not just v1.
Proposed fix
 		resourceSliceObject := resourceapi.ResourceSlice{
 			TypeMeta: metav1.TypeMeta{
 				Kind:       "ResourceSlice",
-				APIVersion: "v1",
+				APIVersion: "resource.k8s.io/v1",
 			},
@@
 		deviceClassObject := resourceapi.DeviceClass{
 			TypeMeta: metav1.TypeMeta{
 				Kind:       "DeviceClass",
-				APIVersion: "v1",
+				APIVersion: "resource.k8s.io/v1",
 			},

🤖 Fix all issues with AI agents

In `@CHANGELOG.md`:
- Line 45: Update the changelog entry text to past tense and match surrounding
style by changing "- fix DRA scheduler simulations and the maxNodeResources
predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966)" to use
"Fixed DRA scheduler simulations and the MaxNodeResources predicate" so it reads
past-tense and capitalizes the predicate name consistently.

In `@pkg/scheduler/api/bindrequest_info/binrequest_info.go`:
- Around line 43-55: ResourceClaimInfo.Clone() can panic when iterating
ResourceClaimAllocation entries because info.Allocation may be nil; update the
Clone implementation in ResourceClaimInfo.Clone() so that for each info you
check if info.Allocation == nil and set newrci[i].Allocation = nil in that case,
otherwise call info.Allocation.DeepCopy() and assign the result; preserve
copying of Name and ensure newrci entries are allocated as
&schedulingv1alpha2.ResourceClaimAllocation before assigning Allocation.

In `@pkg/scheduler/api/resource_info/dra_resource_utils.go`:
- Around line 17-23: Add a GoDoc comment for the exported function
ResourceClaimSliceToMap: write a comment immediately above the function that
starts with "ResourceClaimSliceToMap" and briefly describes what the function
does (converts a slice of *resourceapi.ResourceClaim into a map keyed by claim
Name), mention the input (draResourceClaims slice) and the return value
(map[string]*resourceapi.ResourceClaim), and keep it concise and formatted as a
proper GoDoc sentence.

In `@pkg/scheduler/k8s_internal/predicates/maxNodeResources.go`:
- Around line 62-64: The code calls resource_info.GetDraPodClaims(...) and then
pod_info.NewTaskInfo(...), but NewTaskInfo can return nil (e.g., when
calcResourceClaimInfo fails) and podInfo is dereferenced afterwards; add a
nil-check after podInfo is created (from pod_info.NewTaskInfo) and handle the
error path (return false / skip this pod / log an error as appropriate for the
predicate) to avoid a panic—locate references to podInfo in this function (and
any subsequent dereferences) and guard them with the nil-check.

In `@pkg/scheduler/test_utils/test_utils_builder.go`:
- Around line 341-376: getResourceClaims is creating resourceapi.ResourceClaim
objects with the wrong APIVersion and no UID, causing collisions in
podsToClaimsMap; update the TypeMeta.APIVersion to "resource.k8s.io/v1" and
populate ObjectMeta.UID from the source test claim (e.g., use resourceClaim.UID
or generate a stable non-empty UID per test claim) so each created
resourceapi.ResourceClaim has a unique metadata. Ensure you set ObjectMeta.UID
on the resourceClaimObject before appending to objects and keep TypeMeta.Kind as
"ResourceClaim".

🧹 Nitpick comments (2)

pkg/scheduler/k8s_internal/predicates/maxNodeResources_test.go (1)

307-314: Consider adding test coverage for DRA claims.

The existing tests verify backward compatibility with empty resourceClaims, but there's no coverage for the new DRA claim filtering logic. Consider adding tests that exercise GetDraPodClaims and the claim-aware code path in a follow-up.
pkg/scheduler/k8s_internal/predicates/maxNodeResources.go (1)
17-21: Import organization does not follow guidelines.

The resourceapi import on line 20 should be grouped with other external dependencies (lines 11-15), not placed after internal imports.
♻️ Proposed fix
 import (
 	"context"
 	"fmt"
 	"strings"

 	"github.com/dustin/go-humanize"
 	v1 "k8s.io/api/core/v1"
+	resourceapi "k8s.io/api/resource/v1"
 	"k8s.io/apimachinery/pkg/types"
 	ksf "k8s.io/kube-scheduler/framework"
 	k8sframework "k8s.io/kubernetes/pkg/scheduler/framework"

 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/node_info"
 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/pod_info"
 	"github.com/NVIDIA/KAI-scheduler/pkg/scheduler/api/resource_info"
-	resourceapi "k8s.io/api/resource/v1"
 )
As per coding guidelines: "Organize imports in three groups separated by blank lines: (1) Standard library, (2) External dependencies, (3) Internal packages".

coderabbitai · 2026-02-05T07:02:45Z

CHANGELOG.md

 - Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected [#924](https://github.com/NVIDIA/KAI-Scheduler/pull/924) [itsomri](https://github.com/itsomri)
 - cpu-only nodes calculation in DRA enabled clusters [#944](https://github.com/NVIDIA/KAI-Scheduler/pull/944)
 - enable DRA flag override fix in snapshot-tool [#955](https://github.com/NVIDIA/KAI-Scheduler/pull/955)
+- fix DRA scheduler simulations and the maxNodeResources predicate [#966](https://github.com/NVIDIA/KAI-Scheduler/pull/966) 


⚠️ Potential issue | 🟡 Minor

Use past-tense "Fixed" for consistency.

The "Fixed" section uses past-tense entries; consider "Fixed DRA scheduler simulations and the MaxNodeResources predicate" to match the surrounding style.

🤖 Prompt for AI Agents

In `@CHANGELOG.md` at line 45, Update the changelog entry text to past tense and match surrounding style by changing "- fix DRA scheduler simulations and the maxNodeResources predicate [`#966`](https://github.com/NVIDIA/KAI-Scheduler/pull/966)" to use "Fixed DRA scheduler simulations and the MaxNodeResources predicate" so it reads past-tense and capitalizes the predicate name consistently.

coderabbitai · 2026-02-05T07:02:45Z

pkg/scheduler/api/bindrequest_info/binrequest_info.go

 func (rci ResourceClaimInfo) Clone() ResourceClaimInfo {
 	if rci == nil {
 		return nil
 	}
 	newrci := make(ResourceClaimInfo, len(rci))
 	for i, info := range rci {
-		newrci[i] = schedulingv1alpha2.ResourceClaimAllocation{
+		newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{
 			Name:       info.Name,
 			Allocation: info.Allocation.DeepCopy(),
 		}
 	}
 	return newrci
 }


⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereference in Clone() when Allocation is nil.

ResourceClaimAllocation.Allocation is a pointer that can be nil (per the type definition in bindrequest_types.go). Calling info.Allocation.DeepCopy() on a nil Allocation will panic.

🐛 Proposed fix

func (rci ResourceClaimInfo) Clone() ResourceClaimInfo { if rci == nil { return nil } newrci := make(ResourceClaimInfo, len(rci)) for i, info := range rci { + var allocationCopy *v1.AllocationResult + if info.Allocation != nil { + allocationCopy = info.Allocation.DeepCopy() + } newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{ Name: info.Name, - Allocation: info.Allocation.DeepCopy(), + Allocation: allocationCopy, } } return newrci }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (rci ResourceClaimInfo) Clone() ResourceClaimInfo {

if rci == nil {

return nil

}

newrci := make(ResourceClaimInfo, len(rci))

for i, info := range rci {

newrci[i] = schedulingv1alpha2.ResourceClaimAllocation{

newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{

Name: info.Name,

Allocation: info.Allocation.DeepCopy(),

}

}

return newrci

}

func (rci ResourceClaimInfo) Clone() ResourceClaimInfo {

if rci == nil {

return nil

}

newrci := make(ResourceClaimInfo, len(rci))

for i, info := range rci {

var allocationCopy *v1.AllocationResult

if info.Allocation != nil {

allocationCopy = info.Allocation.DeepCopy()

}

newrci[i] = &schedulingv1alpha2.ResourceClaimAllocation{

Name: info.Name,

Allocation: allocationCopy,

}

}

return newrci

}

🤖 Prompt for AI Agents

In `@pkg/scheduler/api/bindrequest_info/binrequest_info.go` around lines 43 - 55, ResourceClaimInfo.Clone() can panic when iterating ResourceClaimAllocation entries because info.Allocation may be nil; update the Clone implementation in ResourceClaimInfo.Clone() so that for each info you check if info.Allocation == nil and set newrci[i].Allocation = nil in that case, otherwise call info.Allocation.DeepCopy() and assign the result; preserve copying of Name and ensure newrci entries are allocated as &schedulingv1alpha2.ResourceClaimAllocation before assigning Allocation.

coderabbitai · 2026-02-05T07:02:45Z

pkg/scheduler/api/resource_info/dra_resource_utils.go

+func ResourceClaimSliceToMap(draResourceClaims []*resourceapi.ResourceClaim) map[string]*resourceapi.ResourceClaim {
+	draClaimMap := map[string]*resourceapi.ResourceClaim{}
+	for _, draClaim := range draResourceClaims {
+		draClaimMap[draClaim.Name] = draClaim
+	}
+	return draClaimMap
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "dra_resource_utils.go" -type f

Repository: NVIDIA/KAI-Scheduler

Length of output: 119

🏁 Script executed:

head -30 pkg/scheduler/api/resource_info/dra_resource_utils.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 1345

Add GoDoc comment for exported function ResourceClaimSliceToMap.

This exported function is missing a GoDoc header, which is required by the coding guidelines and impacts generated documentation completeness.

Suggested GoDoc

+// ResourceClaimSliceToMap converts a slice of ResourceClaims into a lookup map. func ResourceClaimSliceToMap(draResourceClaims []*resourceapi.ResourceClaim) map[string]*resourceapi.ResourceClaim {

🤖 Prompt for AI Agents

In `@pkg/scheduler/api/resource_info/dra_resource_utils.go` around lines 17 - 23, Add a GoDoc comment for the exported function ResourceClaimSliceToMap: write a comment immediately above the function that starts with "ResourceClaimSliceToMap" and briefly describes what the function does (converts a slice of *resourceapi.ResourceClaim into a map keyed by claim Name), mention the input (draResourceClaims slice) and the return value (map[string]*resourceapi.ResourceClaim), and keep it concise and formatted as a proper GoDoc sentence.

coderabbitai · 2026-02-05T07:02:45Z

pkg/scheduler/k8s_internal/predicates/maxNodeResources.go

+	draPodClaims := resource_info.GetDraPodClaims(pod, mnr.resourceClaimsMap, mnr.podsToClaimsMap)
+	podInfo := pod_info.NewTaskInfo(pod, draPodClaims...)



⚠️ Potential issue | 🟠 Major

Missing nil check for podInfo can cause panic.

pod_info.NewTaskInfo can return nil if calcResourceClaimInfo fails. The subsequent code dereferences podInfo without checking for nil, which will cause a panic.

🐛 Proposed fix

draPodClaims := resource_info.GetDraPodClaims(pod, mnr.resourceClaimsMap, mnr.podsToClaimsMap) podInfo := pod_info.NewTaskInfo(pod, draPodClaims...) + if podInfo == nil { + return nil, ksf.NewStatus(ksf.UnschedulableAndUnresolvable, + fmt.Sprintf("failed to create pod info for pod %s/%s", pod.Namespace, pod.Name)) + } if podInfo.ResReq.GPUs() > mnr.maxResources.GPUs() {

🤖 Prompt for AI Agents

In `@pkg/scheduler/k8s_internal/predicates/maxNodeResources.go` around lines 62 - 64, The code calls resource_info.GetDraPodClaims(...) and then pod_info.NewTaskInfo(...), but NewTaskInfo can return nil (e.g., when calcResourceClaimInfo fails) and podInfo is dereferenced afterwards; add a nil-check after podInfo is created (from pod_info.NewTaskInfo) and handle the error path (return false / skip this pod / log an error as appropriate for the predicate) to avoid a panic—locate references to podInfo in this function (and any subsequent dereferences) and guard them with the nil-check.

coderabbitai · 2026-02-05T07:02:45Z

pkg/scheduler/test_utils/test_utils_builder.go

+func getResourceClaims(testMetadata TestTopologyBasic) []*resourceapi.ResourceClaim {
+	var objects []*resourceapi.ResourceClaim
+	for _, resourceClaim := range testMetadata.ResourceClaims {
+		resourceClaimObject := resourceapi.ResourceClaim{
 			TypeMeta: metav1.TypeMeta{
-				Kind:       "DeviceClass",
+				Kind:       "ResourceClaim",
 				APIVersion: "v1",
 			},
 			ObjectMeta: metav1.ObjectMeta{
-				Name:            deviceClass,
+				Name:            resourceClaim.Name,
+				Namespace:       resourceClaim.Namespace,
 				ResourceVersion: "0",
+				Labels:          resourceClaim.Labels,
+			},
+			Spec: resourceapi.ResourceClaimSpec{
+				Devices: resourceapi.DeviceClaim{
+					Requests: []resourceapi.DeviceRequest{
+						{
+							Name: "request",
+							Exactly: &resourceapi.ExactDeviceRequest{
+								DeviceClassName: resourceClaim.DeviceClassName,
+								AllocationMode:  resourceapi.DeviceAllocationModeExactCount,
+								Count:           resourceClaim.Count,
+							},
+						},
+					},
+				},
 			},
-			Spec: resourceapi.DeviceClassSpec{},
 		}
-		objects = append(objects, &deviceClassObject)
+		if resourceClaim.ClaimStatus != nil {
+			resourceClaimObject.Status = *resourceClaim.ClaimStatus
+		}
+		objects = append(objects, &resourceClaimObject)
 	}
+	return objects
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for podsToClaimsMap to understand how it indexes claims rg "podsToClaimsMap" -A 3 -B 3

Repository: NVIDIA/KAI-Scheduler

Length of output: 8623

🏁 Script executed:

# Search for ResourceClaim usage in the scheduler code rg "ResourceClaim" -t go --max-count=20 | head -40

Repository: NVIDIA/KAI-Scheduler

Length of output: 4489

🏁 Script executed:

# Look for the test_utils_builder file and examine similar patterns fd "test_utils_builder.go" -x cat -n {} | head -100

Repository: NVIDIA/KAI-Scheduler

Length of output: 4319

🏁 Script executed:

# Search for how other resources are created in test utilities with UIDs rg "types\.UID" -t go -B 2 -A 2 | head -50

Repository: NVIDIA/KAI-Scheduler

Length of output: 4566

🏁 Script executed:

# Check how ResourceClaim is created in test utilities cat -n test/e2e/modules/resources/rd/resource_claim.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 2672

🏁 Script executed:

# Search for APIVersion patterns in ResourceClaim creation rg "APIVersion.*resource" -t go -B 2 -A 2

Repository: NVIDIA/KAI-Scheduler

Length of output: 773

🏁 Script executed:

# Look at how other test utilities set ObjectMeta in test_utils_builder rg "ObjectMeta.*metav1" -A 10 pkg/scheduler/test_utils/test_utils_builder.go | head -60

Repository: NVIDIA/KAI-Scheduler

Length of output: 1915

🏁 Script executed:

# Check the actual addClaimToPodClaimMap implementation rg "addClaimToPodClaimMap" -A 10 pkg/scheduler/api/resource_info/dra_resource_utils.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 1396

🏁 Script executed:

# Check if there's actual use of ResourceClaim APIVersion in the codebase rg "resource\.k8s\.io" -t go --max-count=10

Repository: NVIDIA/KAI-Scheduler

Length of output: 1255

🏁 Script executed:

# Look for Kubernetes API version patterns in actual Kubernetes object creation rg "APIVersion.*=.*\"" -t go pkg/scheduler/test_utils/test_utils_builder.go | head -20

Repository: NVIDIA/KAI-Scheduler

Length of output: 46

🏁 Script executed:

# Check if there are any ResourceClaim creation patterns in tests that show UID or APIVersion rg "ResourceClaim" -A 8 pkg/scheduler/test_utils/test_utils_builder.go | head -50

Repository: NVIDIA/KAI-Scheduler

Length of output: 1420

🏁 Script executed:

# Look at how the test utilizes ResourceClaims - search for getResourceClaims usage rg "getResourceClaims" -B 3 -A 3

Repository: NVIDIA/KAI-Scheduler

Length of output: 1752

🏁 Script executed:

# Check what types.UID expects and how NamespacedName.String() works python3 << 'EOF' # types.UID is just a string type in Kubernetes # Let's see what NamespacedName.String() produces namespace = "default" name = "my-claim" namespaced_name_string = f"{namespace}/{name}" print(f"NamespacedName.String() would produce: {namespaced_name_string}") print(f"This format looks correct for a test UID") EOF

Repository: NVIDIA/KAI-Scheduler

Length of output: 160

🏁 Script executed:

# Verify if the proposed UID format is actually appropriate by checking similar patterns rg "types\.UID\(" pkg/scheduler/test_utils/test_utils_builder.go -A 1 -B 1

Repository: NVIDIA/KAI-Scheduler

Length of output: 441

🏁 Script executed:

# Check how UIDs are handled when ResourceClaims are actually used in tests rg "claim\.UID" -B 3 -A 3 pkg/scheduler/api/resource_info/dra_resource_utils.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 223

🏁 Script executed:

# Verify the current APIVersion value in the code sed -n '341,376p' pkg/scheduler/test_utils/test_utils_builder.go | grep -A 50 "getResourceClaims"

Repository: NVIDIA/KAI-Scheduler

Length of output: 1183

🏁 Script executed:

# Check if types.NamespacedName is available in test_utils_builder.go rg "types\." pkg/scheduler/test_utils/test_utils_builder.go | head -10

Repository: NVIDIA/KAI-Scheduler

Length of output: 160

🏁 Script executed:

# Verify the exact line numbers and context around the function sed -n '336,380p' pkg/scheduler/test_utils/test_utils_builder.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 1396

🏁 Script executed:

# Check if there are any tests that depend on multiple ResourceClaims to verify the collapse issue rg "ResourceClaims" pkg/scheduler/test_utils/test_utils_builder.go -B 2 -A 2

Repository: NVIDIA/KAI-Scheduler

Length of output: 693

🏁 Script executed:

# Check the imports in test_utils_builder.go head -40 pkg/scheduler/test_utils/test_utils_builder.go | grep import -A 30

Repository: NVIDIA/KAI-Scheduler

Length of output: 1330

🏁 Script executed:

# Check if apimachinery/pkg/types provides NamespacedName rg "NamespacedName" pkg/scheduler/test_utils/test_utils_builder.go

Repository: NVIDIA/KAI-Scheduler

Length of output: 46

🏁 Script executed:

# Look for how this pattern is used elsewhere in the codebase rg "NamespacedName.*String" -B 2 -A 2

Repository: NVIDIA/KAI-Scheduler

Length of output: 1415

Set ResourceClaim UID and correct apiVersion to match resource.k8s.io API group.

The podsToClaimsMap indexes claims by claim.UID; leaving it unset causes all claims to collide at the empty-string key, collapsing multiple test claims into a single entry. Additionally, ResourceClaim objects belong to the resource.k8s.io/v1 API group, not v1.

🐛 Proposed fix

resourceClaimObject := resourceapi.ResourceClaim{ TypeMeta: metav1.TypeMeta{ Kind: "ResourceClaim", - APIVersion: "v1", + APIVersion: "resource.k8s.io/v1", }, ObjectMeta: metav1.ObjectMeta{ Name: resourceClaim.Name, Namespace: resourceClaim.Namespace, + UID: types.UID(types.NamespacedName{Namespace: resourceClaim.Namespace, Name: resourceClaim.Name}.String()), ResourceVersion: "0", Labels: resourceClaim.Labels, },

🤖 Prompt for AI Agents

In `@pkg/scheduler/test_utils/test_utils_builder.go` around lines 341 - 376, getResourceClaims is creating resourceapi.ResourceClaim objects with the wrong APIVersion and no UID, causing collisions in podsToClaimsMap; update the TypeMeta.APIVersion to "resource.k8s.io/v1" and populate ObjectMeta.UID from the source test claim (e.g., use resourceClaim.UID or generate a stable non-empty UID per test claim) so each created resourceapi.ResourceClaim has a unique metadata. Ensure you set ObjectMeta.UID on the resourceClaimObject before appending to objects and keep TypeMeta.Kind as "ResourceClaim".

SiorMeir · 2026-02-05T10:20:30Z

pkg/scheduler/api/resource_info/dra_resource_utils.go

+// CalcClaimsToPodsBaseMap calculates a pod to claims map. The first key in the map is the pod UID, and the second key in the map is the claim UID.
+// The relationships are calculated based on the resourceclaim owner references and allocation reservedFor references. This might not be enough to identify all the claims that are related to a pod.


NIT: is there a way to simplify this explaination? maybe an example of the map instead?

SiorMeir · 2026-02-05T10:22:38Z

pkg/scheduler/k8s_internal/predicates/maxNodeResources.go

 				mnr.buildUnschedulableMessage(podInfo, string(rName), float64(maxVal), units))
 		}
 	}
+	// TODO: check if any of the resource slices good for the node can satisfy the pod's claim requests (device count for the device class)


maybe a reference in the CHANGELOG?

SiorMeir · 2026-02-05T10:23:42Z

pkg/scheduler/plugins/dynamicresources/dynamicresources.go


+	// If the claim info has already been allocated in the past (the deallocation was virtual), recover previous allocation data
+	allocatedFromMemory := false
+	if claimAllocationInfo, ok := task.ResourceClaimInfo[podClaim.Name]; ok && claimAllocationInfo.Allocation != nil {


QUESTION: Do we want to somehow tag a previously-allocated resourceClaim?
maybe represent it in a snapshot?

davidLif added 2 commits February 5, 2026 01:08

Changelog update

d8c44cb

Add copyright to new file

7b29cb1

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

SiorMeir reviewed Feb 5, 2026

View reviewed changes

-	previousResourceClaimInfo := reclaimee.ResourceClaimInfo
+	previousStatus := reclaimee.Status
+	previousGpuGroup := reclaimee.GPUGroups
+	var previousResourceClaimInfo bindrequest_info.ResourceClaimInfo
+	if reclaimee.ResourceClaimInfo != nil {
+		previousResourceClaimInfo = reclaimee.ResourceClaimInfo.Clone()
+	}
+	previousIsVirtualStatus := reclaimee.IsVirtualStatus

		draPodClaims := resource_info.GetDraPodClaims(pod, mnr.resourceClaimsMap, mnr.podsToClaimsMap)
		podInfo := pod_info.NewTaskInfo(pod, draPodClaims...)

		// CalcClaimsToPodsBaseMap calculates a pod to claims map. The first key in the map is the pod UID, and the second key in the map is the claim UID.
		// The relationships are calculated based on the resourceclaim owner references and allocation reservedFor references. This might not be enough to identify all the claims that are related to a pod.

fix(scheduler): scheduler Dra simulations #966

Are you sure you want to change the base?

fix(scheduler): scheduler Dra simulations #966

Uh oh!

Conversation

davidLif commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Checklist

Breaking Changes

Additional Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Benchmark Results

Legend

Uh oh!

github-actions bot commented Feb 4, 2026

Merging this branch changes the coverage (3 decrease, 3 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

davidLif commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

SiorMeir Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

SiorMeir Feb 5, 2026

Choose a reason for hiding this comment

davidLif commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading