-
Notifications
You must be signed in to change notification settings - Fork 12
feat(slurm) add support for dynamic nodes #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9b2e19a to
3af1432
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #202 +/- ##
==========================================
+ Coverage 65.55% 65.83% +0.27%
==========================================
Files 81 81
Lines 4448 4484 +36
==========================================
+ Hits 2916 2952 +36
Misses 1419 1419
Partials 113 113 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
3af1432 to
7c96aa7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
Additional Comments (1)
Example: If a block has 0 static and 5 dynamic nodes, minDomainSize would be 5 but should be 0 for the static allocation. |
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
7c96aa7 to
d3424c9
Compare
Greptile OverviewGreptile SummaryThis PR adds support for dynamic nodes in Slurm topology configuration, allowing nodes to be marked as dynamically provisioned and excluded from static topology output. Key Changes:
Critical Issue: Missing Test Coverage: Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant SlurmEngine
participant GetTranslateConfig
participant NetworkTopology
participant toBlockTopology
participant splitNodes
User->>SlurmEngine: GenerateOutput(tree, params)
Note over SlurmEngine: params include DynamicNodes & MinBlocks
SlurmEngine->>GetTranslateConfig: GetTranslateConfig(BaseParams)
GetTranslateConfig->>GetTranslateConfig: Create translate.Config
Note over GetTranslateConfig: Copy DynamicNodes, MinBlocks to Config
GetTranslateConfig-->>SlurmEngine: Config with DynamicNodes & MinBlocks
SlurmEngine->>NetworkTopology: NewNetworkTopology(root, cfg)
NetworkTopology-->>SlurmEngine: NetworkTopology instance
SlurmEngine->>NetworkTopology: Generate(writer)
NetworkTopology->>toBlockTopology: toBlockTopology(writer)
Note over toBlockTopology: Convert DynamicNodes list to map
toBlockTopology->>toBlockTopology: nodeList2map(config.DynamicNodes)
loop For each block
toBlockTopology->>splitNodes: splitNodes(block.nodes, dynamicNodeMap)
Note over splitNodes: Separate nodes into static & dynamic
splitNodes-->>toBlockTopology: static string, dynamic string
alt Fake nodes enabled & block too small
Note over toBlockTopology: BUG: Uses len(bInfo.nodes) - total count!
toBlockTopology->>toBlockTopology: Calculate fake nodes needed
Note over toBlockTopology: Should use static node count only
end
toBlockTopology->>toBlockTopology: Write block line
Note over toBlockTopology: Nodes=<static> # dynamic=<dynamic>
end
alt MinBlocks > block count
loop Generate extra empty blocks
toBlockTopology->>toBlockTopology: Write extraBlock with Nodes=
end
end
toBlockTopology-->>NetworkTopology: Block topology output
NetworkTopology-->>SlurmEngine: Generated config
SlurmEngine-->>User: Topology configuration
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
|
|
||
| outputNodeNames := strings.Join(cluset.Compact(bInfo.nodes), ",") | ||
| static, dynamic := splitNodes(bInfo.nodes, dynamicNodeMap) | ||
| if fnc != nil && len(bInfo.nodes) < fnc.baseBlockSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses total node count len(bInfo.nodes) which includes both static and dynamic nodes. When dynamic nodes are present, this incorrectly calculates the number of fake nodes needed.
The bug: After splitting on line 107, bInfo.nodes still contains all nodes (static + dynamic), but only static nodes are output to the Nodes= field. This causes the fake node calculation to be wrong.
Example scenario:
- Block has 1 static node + 2 dynamic nodes (total=3)
baseBlockSize= 4- Current code: adds 4-3=1 fake node → static section will have 2 nodes (wrong!)
- Expected: should add 4-1=3 fake nodes → static section needs 4 nodes
The fake nodes should pad only the static nodes, not the total. Need to count static nodes after splitting, before calculating fake nodes needed.
No description provided.