-
Notifications
You must be signed in to change notification settings - Fork 152
feat: Track pod count as a scalar resource #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Performance Benchmark ResultsComparing PR ( Legend
Raw benchmark dataPR branch: Main branch: |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1131ad9 to
39f9e25
Compare
Alternative approach to fixing max pods predicate with releasing pods. Instead of tracking allocated pod count separately, treat pods as a scalar resource similar to CPU and memory. Changes: - Add pods to node allocatable resources (v1.ResourcePods) - Add pods to task resource requirements (1 pod per task, 2 for shared GPU) - Remove explicit max pods check from predicates (resource accounting handles it) - Update test helpers to include pods in node resources - Add test demonstrating releasing pods don't block preemption This approach naturally handles releasing pods because they are tracked in the Releasing resources, making their pod count available for new allocations. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Implement predicate-based pod counting that accurately accounts for GPU group reservation pods. Previously, all shared GPU tasks added +1 to pod count, but the correct behavior is: - First fractional pod on a GPU: +2 pods (task + reservation) - Additional pods on same GPU: +1 pod (task only) Changes: - Remove blanket +1 pod count from pod_info ResReq - Add checkMaxPodsWithGpuGroupReservation predicate that: * Determines if task creates new GPU group using allocation logic * Counts reservation pods being freed when GPU groups fully release * Only adds reservation pod count when new group is created - Export GPU sharing functions for predicate access - Store session in predicates plugin for GPU allocation logic access Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Add comprehensive e2e tests to validate accurate pod counting with GPU sharing and max pods limits: 1. Simple case: Verify preemption works on node at max pods 2. Fractions allocation: Verify fraction pod cannot allocate on node at maxPods-1 (would need maxPods+1 with reservation pod) 3. Proper reservation calculation: Verify fraction can preempt another fraction and reuse the same GPU group reservation pod These tests dynamically read the node's max pod capacity instead of hardcoding values, making them portable across different cluster configurations. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
6ed007d to
4d5651a
Compare
Description
This PR fixed max pod tracking on scheduling simulations.
Changes:
Checklist
Breaking Changes
Additional Notes