-
Notifications
You must be signed in to change notification settings - Fork 16
Add cbatch gpus-per-node #292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant CLI as cbatch/calloc/crun
participant G as ParseGres (parser)
participant Util as util.ParseGpusPerNodeStr
participant Task as task.ReqResources
U->>CLI: invoke CLI with flags (--gres?, --gpus-per-node?)
CLI->>G: parse --gres (if present) → gresMap
G-->>CLI: gresMap
CLI->>CLI: setGresGpusFlag = has(gresMap,"gpu")
alt --gpus-per-node provided
CLI->>CLI: if setGresGpusFlag true → conflict
alt conflict
CLI-->>U: Error — Cannot specify both --gres gpus and --gpus-per-node
else no conflict
CLI->>Util: ParseGpusPerNodeStr(flag)
Util-->>CLI: DeviceMap or parse error
alt parse ok
CLI->>Task: task.ReqResources.DeviceMap = parsed DeviceMap
else parse error
CLI-->>U: Invalid --gpus-per-node (detailed error)
end
end
else only --gres or none
CLI->>Task: task.ReqResources.DeviceMap = gresMap (if gpu present) or unchanged
end
note right of CLI: cbatch: task.Exclusive assigned later in shared path
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/util/string.go (1)
758-758: Improve error message clarity.The error message "mixed type: number and number format is not allowed" is confusing and doesn't clearly explain what formats cannot be mixed.
- return nil, fmt.Errorf("mixed type: number and number format is not allowed") + return nil, fmt.Errorf("cannot mix simple number format with type:count format")- return nil, fmt.Errorf("mixed type: number and number format is not allowed") + return nil, fmt.Errorf("cannot mix type:count format with simple number format")Also applies to: 772-772
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/cbatch/CmdArgParser.go(2 hunks)internal/cbatch/cbatch.go(2 hunks)internal/util/string.go(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
internal/cbatch/cbatch.go (2)
internal/util/string.go (1)
ParseGpusPerNodeStr(739-788)internal/cbatch/CmdArgParser.go (1)
FlagGpusPerNode(52-52)
🔇 Additional comments (3)
internal/util/string.go (1)
739-788: Well-implemented GPU parsing function.The function correctly handles both simple numeric and type-specific GPU formats with proper validation and error handling. The logic for enforcing mutual exclusion between the two formats is sound.
internal/cbatch/CmdArgParser.go (1)
52-52: Clean flag implementation following established patterns.The new
--gpus-per-nodeflag is properly declared and registered with clear usage documentation and examples.Also applies to: 167-167
internal/cbatch/cbatch.go (1)
86-92: Consistent argument processing implementation.The
--gpus-per-nodeargument processing follows the established patterns and correctly integrates with the existing resource management system. Error handling and logging are consistent with other arguments.Note that both
--gresand--gpus-per-nodesettask.ReqResources.DeviceMap, meaning the last one processed will override the other. Verify this is the intended behavior:#!/bin/bash # Description: Check if there are any validation rules preventing conflicting GPU specifications # Search for validation logic that might prevent using both --gres and --gpus-per-node rg -A 10 -B 5 "DeviceMap.*gres|gres.*DeviceMap" # Look for any documentation about the interaction between these flags fd -e md -e txt | xargs rg -l "gres|gpus-per-node" | head -5 | xargs catAlso applies to: 200-207
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
internal/calloc/CmdArgParser.go(2 hunks)internal/calloc/calloc.go(1 hunks)internal/cbatch/cbatch.go(2 hunks)internal/crun/CmdArgParser.go(2 hunks)internal/crun/crun.go(1 hunks)internal/util/string.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/cbatch/cbatch.go
- internal/util/string.go
🧰 Additional context used
🧬 Code Graph Analysis (2)
internal/crun/crun.go (3)
internal/crun/CmdArgParser.go (1)
FlagGpusPerNode(44-44)internal/util/string.go (1)
ParseGpusPerNodeStr(739-788)internal/util/err.go (2)
CraneError(44-47)ErrorCmdArg(37-37)
internal/calloc/calloc.go (3)
internal/calloc/CmdArgParser.go (1)
FlagGpusPerNode(43-43)internal/util/string.go (1)
ParseGpusPerNodeStr(739-788)internal/util/err.go (2)
CraneError(44-47)ErrorCmdArg(37-37)
🔇 Additional comments (2)
internal/crun/CmdArgParser.go (1)
44-44: LGTM: Clean flag declaration and registrationThe flag variable declaration and registration follow the established patterns in the codebase. The help text is descriptive and includes useful examples for users.
Also applies to: 111-111
internal/calloc/CmdArgParser.go (1)
43-43: LGTM: Consistent flag implementation across commandsThe flag declaration and registration are consistent with the
cruncommand, providing a uniform user experience. The implementation follows established patterns and includes clear help text.Also applies to: 100-100
c84eda8 to
e6f6a74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for a new --gpus-per-node command-line flag to the batch, calloc, and crun commands, enabling users to specify GPU requirements per node with flexible formats. The implementation includes validation to prevent simultaneous use of conflicting GPU specification flags.
- Adds a new parsing function
ParseGpusPerNodeStrto handle the--gpus-per-nodeformat - Implements mutual exclusion validation between
--gres(when specifying GPUs) and--gpus-per-nodeflags - Integrates the new flag across all three command interfaces (batch, calloc, crun)
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/util/string.go | Adds ParseGpusPerNodeStr function to parse GPU specifications in the new format |
| internal/crun/crun.go | Integrates --gpus-per-node flag with validation against conflicting --gres usage |
| internal/crun/CmdArgParser.go | Adds command-line flag definition for --gpus-per-node in crun |
| internal/cbatch/cbatch.go | Implements --gpus-per-node support in batch processing with conflict detection |
| internal/cbatch/CmdArgParser.go | Adds command-line flag definition for --gpus-per-node in cbatch |
| internal/calloc/calloc.go | Adds --gpus-per-node flag support with validation in calloc |
| internal/calloc/CmdArgParser.go | Adds command-line flag definition for --gpus-per-node in calloc |
internal/util/string.go
Outdated
| if mode == "" { | ||
| mode = "type" | ||
| } else if mode != "type" { | ||
| return nil, fmt.Errorf("mixed type: number and number format is not allowed") |
Copilot
AI
Jul 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "mixed type: number and number format is not allowed" is confusing. It should be clearer, such as "cannot mix simple numeric format with type:count format"
| return nil, fmt.Errorf("mixed type: number and number format is not allowed") | |
| return nil, fmt.Errorf("cannot mix simple numeric format with type:count format") |
| Code: util.ErrorCmdArg, | ||
| Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err), | ||
| } | ||
|
|
Copilot
AI
Jul 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There's an extra blank line before the closing brace. This inconsistent spacing affects code readability.
| Code: util.ErrorCmdArg, | ||
| Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err), | ||
| } | ||
|
|
Copilot
AI
Jul 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There's an extra blank line before the closing brace. This inconsistent spacing affects code readability and should match the formatting in other similar blocks.
| if FlagExclusive { | ||
| task.Exclusive = true | ||
| } | ||
| if FlagOpenMode != "" { |
Copilot
AI
Jul 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the FlagExclusive handling code appears to be unintentional. The exclusive flag functionality has been deleted without apparent justification, which could break existing functionality.
| if FlagComment != "" { | ||
| structExtraFromCli.Comment = FlagComment | ||
| } | ||
| if FlagExclusive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这为什么删了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前rebase导致多了一个if FlagExclusive判断,修改这个需求发现的,所以直接删除了
internal/util/string.go
Outdated
| } | ||
| gpusPerNodeStrList := strings.Split(gpuPerNodeStr, ",") | ||
|
|
||
| mode := "" // "digit" or "type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个mode有什么用?好像完全没必要限制用户只能用一种方式?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那也没必要用个string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改,改成enum
e6f6a74 to
076ead0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
internal/crun/cmd.go (1)
117-117: Fix help text: clarify semantics and casing.The flag name is “gpus-per-node” but the help says “per task” and uses inconsistent casing.
Apply this diff:
-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "Gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"") +RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")internal/calloc/cmd.go (1)
103-103: Fix help text: clarify semantics and casing.Match crun/cbatch wording.
Apply this diff:
-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "Gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"") +RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")internal/cbatch/cmd.go (1)
175-175: Fix help text: clarify semantics and casing.Keep wording consistent across commands.
Apply this diff:
-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"") +RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")internal/cbatch/cbatch.go (1)
96-108: Optional: centralize map merge to avoid duplication.Consider a small helper (e.g., util.MergeDeviceMaps(dst, src) with overwrite semantics per resource name) and reuse in both blocks.
Also applies to: 228-240
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
internal/calloc/calloc.go(2 hunks)internal/calloc/cmd.go(2 hunks)internal/cbatch/cbatch.go(3 hunks)internal/cbatch/cmd.go(2 hunks)internal/crun/cmd.go(2 hunks)internal/crun/crun.go(2 hunks)internal/util/string.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/calloc/calloc.go
- internal/util/string.go
- internal/crun/crun.go
🧰 Additional context used
🧬 Code graph analysis (4)
internal/calloc/cmd.go (2)
internal/cbatch/cmd.go (1)
FlagGpusPerNode(52-52)internal/crun/cmd.go (1)
FlagGpusPerNode(46-46)
internal/cbatch/cmd.go (2)
internal/calloc/cmd.go (1)
FlagGpusPerNode(43-43)internal/crun/cmd.go (1)
FlagGpusPerNode(46-46)
internal/crun/cmd.go (2)
internal/calloc/cmd.go (1)
FlagGpusPerNode(43-43)internal/cbatch/cmd.go (1)
FlagGpusPerNode(52-52)
internal/cbatch/cbatch.go (4)
internal/util/string.go (2)
ParseGpusPerNodeStr(761-810)ParseGres(714-759)internal/calloc/cmd.go (2)
FlagGres(37-37)FlagGpusPerNode(43-43)internal/cbatch/cmd.go (2)
FlagGres(45-45)FlagGpusPerNode(52-52)internal/crun/cmd.go (2)
FlagGres(41-41)FlagGpusPerNode(46-46)
🪛 GitHub Actions: Go Code Quality Check
internal/crun/cmd.go
[error] 1-1: gofmt formatting check failed. The file 'internal/crun/cmd.go' is not properly formatted. Please run 'gofmt -w .' to fix.
🔇 Additional comments (7)
internal/crun/cmd.go (2)
46-46: LGTM on flag var exposure.New FlagGpusPerNode aligns with existing flag vars.
1-119: gofmt failure — verification unavailable in sandboxgofmt is not installed in the verification environment ("gofmt: command not found"). Run locally: gofmt -s -w internal/crun/cmd.go, commit the formatted file, and push.
internal/calloc/cmd.go (1)
43-43: LGTM on flag var exposure.internal/cbatch/cmd.go (1)
52-52: LGTM on flag var exposure.internal/cbatch/cbatch.go (3)
67-69: LGTM on mutual‑exclusion guards introduction.setGresGpusFlag and setGpusPerNodeFlag make the constraint explicit.
88-95: LGTM on detecting GPU in --gres for exclusion logic.
218-227: LGTM on CLI --gres handling and guard.
| case "--gpus-per-node": | ||
| if setGresGpusFlag { | ||
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | ||
| return false, nil | ||
| } | ||
| setGpusPerNodeFlag = true | ||
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val) | ||
| if err != nil { | ||
| log.Errorf("Invalid argument: %v in script: %v", arg.name, err) | ||
| return false, nil | ||
| } | ||
| task.ReqResources.DeviceMap = gpuDeviceMap | ||
| case "--ntasks-per-node": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t clobber non‑GPU devices when --gpus-per-node is used after --gres.
Assigning gpuDeviceMap overwrites any non‑GPU entries previously set via --gres (e.g., nic:1). Merge instead, overwriting only the GPU entry.
Apply this diff:
- setGpusPerNodeFlag = true
- gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
- if err != nil {
- log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
- return false, nil
- }
- task.ReqResources.DeviceMap = gpuDeviceMap
+ setGpusPerNodeFlag = true
+ gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
+ if err != nil {
+ log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
+ return false, nil
+ }
+ if task.ReqResources.DeviceMap == nil || task.ReqResources.DeviceMap.NameTypeMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 {
+ task.ReqResources.DeviceMap = gpuDeviceMap
+ } else {
+ // Overwrite/insert GPU mapping while preserving non-GPU devices from --gres.
+ for name, tcm := range gpuDeviceMap.NameTypeMap {
+ task.ReqResources.DeviceMap.NameTypeMap[name] = tcm
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case "--gpus-per-node": | |
| if setGresGpusFlag { | |
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| return false, nil | |
| } | |
| setGpusPerNodeFlag = true | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val) | |
| if err != nil { | |
| log.Errorf("Invalid argument: %v in script: %v", arg.name, err) | |
| return false, nil | |
| } | |
| task.ReqResources.DeviceMap = gpuDeviceMap | |
| case "--ntasks-per-node": | |
| case "--gpus-per-node": | |
| if setGresGpusFlag { | |
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| return false, nil | |
| } | |
| setGpusPerNodeFlag = true | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val) | |
| if err != nil { | |
| log.Errorf("Invalid argument: %v in script: %v", arg.name, err) | |
| return false, nil | |
| } | |
| if task.ReqResources.DeviceMap == nil || task.ReqResources.DeviceMap.NameTypeMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 { | |
| task.ReqResources.DeviceMap = gpuDeviceMap | |
| } else { | |
| // Overwrite/insert GPU mapping while preserving non-GPU devices from --gres. | |
| for name, tcm := range gpuDeviceMap.NameTypeMap { | |
| task.ReqResources.DeviceMap.NameTypeMap[name] = tcm | |
| } | |
| } | |
| case "--ntasks-per-node": |
🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 96 to 108, assigning gpuDeviceMap
directly to task.ReqResources.DeviceMap clobbers any non-GPU device entries
previously set via --gres; instead merge gpuDeviceMap into the existing
DeviceMap so only the GPU entry is replaced. Implement: if
task.ReqResources.DeviceMap is nil, set it to a new map; otherwise iterate
gpuDeviceMap and set each key/value into task.ReqResources.DeviceMap (this will
overwrite the "gpu" entry but preserve other device keys like "nic"). Keep error
handling as-is and only replace the assignment with this merge logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
internal/util/string.go (3)
807-812: Duplicate type handling: overwrite vs. accumulate.Currently the last occurrence overwrites prior counts for the same GPU type. If users may repeat a type, consider summing (with overflow guard) or explicitly error on duplicates to avoid silent surprises.
Option A (accumulate counts):
- typeCountMap.TypeCountMap[gpuType] = val + if prev, ok := typeCountMap.TypeCountMap[gpuType]; ok { + if prev > math.MaxUint64-val { + return nil, fmt.Errorf("count overflow for type %q", gpuType) + } + val += prev + } + typeCountMap.TypeCountMap[gpuType] = valOption B (reject duplicates):
- typeCountMap.TypeCountMap[gpuType] = val + if _, exists := typeCountMap.TypeCountMap[gpuType]; exists { + return nil, fmt.Errorf("duplicate gpu type %q in --gpus-per-node", gpuType) + } + typeCountMap.TypeCountMap[gpuType] = val
761-763: Add GoDoc to lock in the accepted formats.A brief comment helps future readers and keeps CLI/docs consistent.
+// ParseGpusPerNodeStr parses the --gpus-per-node flag. +// Accepted forms (mutually exclusive): +// - "N" (simple count, must be the only part) +// - "type:count[,type:count...]" +// Returns a DeviceMap with NameTypeMap["gpu"] populated. func ParseGpusPerNodeStr(gpuPerNodeStr string) (*protos.DeviceMap, error) {
761-819: Add table-driven tests for edge cases.Recommend tests for: "", "2", "a100:2", "a100:2,l40:1", "2,a100:1" (mixing), "a100:" (empty count), ":" (empty type and count), "a100:0", " , ", "a100:2,a100:3" (duplicate type), "a100:1:2" (too many colons).
I can draft a _test.go with these cases if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/crun/cmd.go(2 hunks)internal/util/string.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/crun/cmd.go
🔇 Additional comments (1)
internal/util/string.go (1)
761-819: Solid parser; aligns with Slurm-style rules.Covers empty input, format exclusivity, index constraints, and numeric validation. Looks good.
Please confirm downstream expects key "gpu" in NameTypeMap and that typed mode (TypeCountMap set, Total=0) matches existing ParseGres semantics.
L-Xiafeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
等CI恢复测试后合入
| if FlagGres != "" { | ||
| gresMap := util.ParseGres(FlagGres) | ||
| task.ReqResources.DeviceMap = gresMap | ||
| if _, exist := task.ReqResources.DeviceMap.NameTypeMap["gpu"]; exist { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"gpu"也单独写个常量吧
| } | ||
| if FlagGpusPerNode != "" { | ||
| if setGresGpusFlag { | ||
| return &util.CraneError{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个这样的函数NewCraneErr,统一下不手动构造这个 err
4d3d65f to
089c061
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/crun/crun.go (1)
1181-1199: Critical: Syntax error and undefined variable cause build failure.The code has multiple issues causing the pipeline to fail:
- Line 1186 references
task.ReqResources.DeviceMapbut the variable is namedjobin this context- The
ifblock starting at line 1186 is missing its closing brace and body- The logic structure is incomplete - the
setGresGpusFlag = trueassignment at line 1187 appears orphaned🔎 Proposed fix
setGresGpusFlag := false if FlagGres != "" { gresMap := util.ParseGres(FlagGres) if jobMode { job.ReqResources.DeviceMap = gresMap - if _, exist := task.ReqResources.DeviceMap.NameTypeMap["gpu"]; exist { - setGresGpusFlag = true + if _, exist := gresMap.NameTypeMap["gpu"]; exist { + setGresGpusFlag = true + } } else { if len(gresMap.NameTypeMap) != 0 {
♻️ Duplicate comments (1)
internal/cbatch/cbatch.go (1)
127-127: Consider merging DeviceMap to preserve non-GPU devices.When
--gpus-per-nodeis used after--greswith non-GPU devices (e.g.,--gres nic:1), the current implementation overwrites the entireDeviceMap, losing the non-GPU device entries.🔎 Proposed merge logic
- task.ReqResources.DeviceMap = gpuDeviceMap + if task.ReqResources.DeviceMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 { + task.ReqResources.DeviceMap = gpuDeviceMap + } else { + // Merge GPU mapping while preserving non-GPU devices from --gres + for name, tcm := range gpuDeviceMap.NameTypeMap { + task.ReqResources.DeviceMap.NameTypeMap[name] = tcm + } + }This follows the suggestion from a past review comment to merge instead of overwrite.
Also applies to: 262-262
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
internal/calloc/calloc.gointernal/calloc/cmd.gointernal/cbatch/cbatch.gointernal/cbatch/cmd.gointernal/crun/cmd.gointernal/crun/crun.gointernal/util/string.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/crun/cmd.go
- internal/calloc/cmd.go
- internal/calloc/calloc.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-24T16:05:47.156Z
Learnt from: NamelessOIer
Repo: PKUHPC/CraneSched-FrontEnd PR: 256
File: internal/cqueue/cqueue.go:350-356
Timestamp: 2025-04-24T16:05:47.156Z
Learning: In cqueue.go, the function ProcessAllocMemPerNode should use task.AllocatedResView.AllocatableRes.MemoryLimitBytes (allocated resources) rather than task.ReqResView.AllocatableRes.MemoryLimitBytes (requested resources) to correctly display allocated memory per node.
Applied to files:
internal/crun/crun.gointernal/cbatch/cbatch.go
🧬 Code graph analysis (2)
internal/crun/crun.go (3)
internal/crun/cmd.go (2)
FlagGres(45-45)FlagGpusPerNode(50-50)internal/util/string.go (2)
ParseGres(1058-1103)ParseGpusPerNodeStr(1105-1163)internal/util/err.go (2)
CraneError(43-47)ErrorCmdArg(36-36)
internal/cbatch/cmd.go (2)
internal/crun/cmd.go (1)
FlagGpusPerNode(50-50)internal/calloc/cmd.go (1)
FlagGpusPerNode(44-44)
🪛 GitHub Actions: Go Code Quality Check
internal/crun/crun.go
[error] 1426-1426: Build failed: expected '}', found 'EOF'.
internal/cbatch/cbatch.go
[error] 111-111: go vet ./...: cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
🪛 GitHub Check: go-check
internal/cbatch/cbatch.go
[failure] 125-125:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 119-119:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 111-111:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 260-260:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 254-254:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 245-245:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
🔇 Additional comments (3)
internal/cbatch/cmd.go (1)
53-53: LGTM!The new
FlagGpusPerNodevariable and its CLI flag registration are correctly implemented, following the established pattern used for other flags in this file and consistent with similar declarations ininternal/crun/cmd.goandinternal/calloc/cmd.go.Also applies to: 159-159
internal/util/string.go (1)
1105-1163: LGTM!The
ParseGpusPerNodeStrfunction is well-implemented with:
- Clear mode-based parsing to enforce format consistency (simple number vs type:count)
- Comprehensive validation with descriptive error messages
- Proper handling of edge cases (empty input, empty parts, mixed formats, invalid numbers)
- Consistent behavior with
ParseGresby treating zero values as invalidThe implementation addresses all previous review feedback regarding variable naming and error message clarity.
internal/cbatch/cbatch.go (1)
89-90: LGTM on mutual exclusion guards.The
setGresGpusFlagandsetGpusPerNodeFlagguards correctly implement mutual exclusion between--greswith GPU specifications and--gpus-per-node. The logic properly handles both script arguments and CLI flags.
| if _, exist := gresMap.NameTypeMap["gpu"]; exist { | ||
| if setGpusPerNodeFlag { | ||
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | ||
| return false, nil | ||
| } | ||
| setGresGpusFlag = true | ||
| } | ||
| task.ReqResources.DeviceMap = gresMap | ||
| case "--gpus-per-node": | ||
| if setGresGpusFlag { | ||
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | ||
| return false, nil | ||
| } | ||
| setGpusPerNodeFlag = true | ||
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val) | ||
| if err != nil { | ||
| log.Errorf("Invalid argument: %v in script: %v", arg.name, err) | ||
| return false, nil | ||
| } | ||
| task.ReqResources.DeviceMap = gpuDeviceMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Incorrect return type causes build failure.
The function BuildCbatchJob returns (*protos.TaskToCtld, error), but lines 111, 119, and 125 use return false, nil which is a type mismatch. This causes the pipeline build failure.
🔎 Proposed fix
case "--gres":
gresMap := util.ParseGres(arg.val)
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
- log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
- return false, nil
+ return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
case "--gpus-per-node":
if setGresGpusFlag {
- log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
- return false, nil
+ return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
if err != nil {
- log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
- return false, nil
+ return nil, fmt.Errorf("invalid argument: %s value in script: %w", arg.name, err)
}
task.ReqResources.DeviceMap = gpuDeviceMap🧰 Tools
🪛 GitHub Actions: Go Code Quality Check
[error] 111-111: go vet ./...: cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
🪛 GitHub Check: go-check
[failure] 125-125:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 119-119:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 111-111:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 108 to 127, the error branches
currently return the wrong types (they use "return false, nil" while
BuildCbatchJob's signature is (*protos.TaskToCtld, error)); replace each "return
false, nil" with "return nil, fmt.Errorf(...)" (or errors.New(...)) using the
same error message text as logged (e.g., "cannot specify both --gres gpus and
--gpus-per-node flags simultaneously" and the invalid-argument formatting), and
add the required import for fmt or errors if not present.
| if cmd.Flags().Changed("gres") { | ||
| task.ReqResources.DeviceMap = util.ParseGres(FlagGres) | ||
| gresMap := util.ParseGres(FlagGres) | ||
| if _, exist := gresMap.NameTypeMap["gpu"]; exist { | ||
| if setGpusPerNodeFlag { | ||
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | ||
| return false, nil | ||
| } | ||
| setGresGpusFlag = true | ||
| } | ||
| task.ReqResources.DeviceMap = gresMap | ||
| } | ||
| if cmd.Flags().Changed("gpus-per-node") { | ||
| if setGresGpusFlag { | ||
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | ||
| return false, nil | ||
| } | ||
| setGpusPerNodeFlag = true | ||
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | ||
| if err != nil { | ||
| log.Errorf("Invalid argument: %v", err) | ||
| return false, nil | ||
| } | ||
| task.ReqResources.DeviceMap = gpuDeviceMap | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Incorrect return type in CLI flag handling section.
Same issue as the script parsing section - lines 245, 254, and 260 use return false, nil instead of the correct return type (*protos.TaskToCtld, error).
🔎 Proposed fix
if cmd.Flags().Changed("gres") {
gresMap := util.ParseGres(FlagGres)
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
- log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
- return false, nil
+ return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
}
if cmd.Flags().Changed("gpus-per-node") {
if setGresGpusFlag {
- log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
- return false, nil
+ return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
- log.Errorf("Invalid argument: %v", err)
- return false, nil
+ return nil, fmt.Errorf("invalid argument: invalid --gpus-per-node: %w", err)
}
task.ReqResources.DeviceMap = gpuDeviceMap
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if cmd.Flags().Changed("gres") { | |
| task.ReqResources.DeviceMap = util.ParseGres(FlagGres) | |
| gresMap := util.ParseGres(FlagGres) | |
| if _, exist := gresMap.NameTypeMap["gpu"]; exist { | |
| if setGpusPerNodeFlag { | |
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| return false, nil | |
| } | |
| setGresGpusFlag = true | |
| } | |
| task.ReqResources.DeviceMap = gresMap | |
| } | |
| if cmd.Flags().Changed("gpus-per-node") { | |
| if setGresGpusFlag { | |
| log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| return false, nil | |
| } | |
| setGpusPerNodeFlag = true | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | |
| if err != nil { | |
| log.Errorf("Invalid argument: %v", err) | |
| return false, nil | |
| } | |
| task.ReqResources.DeviceMap = gpuDeviceMap | |
| } | |
| if cmd.Flags().Changed("gres") { | |
| gresMap := util.ParseGres(FlagGres) | |
| if _, exist := gresMap.NameTypeMap["gpu"]; exist { | |
| if setGpusPerNodeFlag { | |
| return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| } | |
| setGresGpusFlag = true | |
| } | |
| task.ReqResources.DeviceMap = gresMap | |
| } | |
| if cmd.Flags().Changed("gpus-per-node") { | |
| if setGresGpusFlag { | |
| return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| } | |
| setGpusPerNodeFlag = true | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | |
| if err != nil { | |
| return nil, fmt.Errorf("invalid argument: invalid --gpus-per-node: %w", err) | |
| } | |
| task.ReqResources.DeviceMap = gpuDeviceMap | |
| } |
🧰 Tools
🪛 GitHub Check: go-check
[failure] 260-260:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 254-254:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
[failure] 245-245:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 240 to 263, the CLI flag handling
returns the wrong types (uses "return false, nil"); change those to the
function's actual return types: for the mutual-exclusion errors replace "return
false, nil" with "return nil, fmt.Errorf(\"Cannot specify both --gres gpus and
--gpus-per-node flags simultaneously\")" (or an errors.New with the same
message) and for the ParseGpusPerNodeStr error return "return nil, err" (or wrap
with fmt.Errorf) so the function returns (*protos.TaskToCtld, error); also add
the fmt import if you use fmt.Errorf.
| if FlagGpusPerNode != "" { | ||
| if setGresGpusFlag { | ||
| return &util.CraneError{ | ||
| Code: util.ErrorCmdArg, | ||
| Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously", | ||
| } | ||
| } | ||
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | ||
| if err != nil { | ||
| return &util.CraneError{ | ||
| Code: util.ErrorCmdArg, | ||
| Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err), | ||
| } | ||
|
|
||
| } | ||
| task.ReqResources.DeviceMap = gpuDeviceMap | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Undefined variable task and missing step mode handling.
Line 1330 references task.ReqResources.DeviceMap but task is not defined in this scope. The variable should be job for job mode or step for step mode. Additionally, step mode handling is completely missing.
🔎 Proposed fix
if FlagGpusPerNode != "" {
if setGresGpusFlag {
- return &util.CraneError{
- Code: util.ErrorCmdArg,
- Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously",
- }
+ return util.NewCraneErr(util.ErrorCmdArg, "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
- return &util.CraneError{
- Code: util.ErrorCmdArg,
- Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
- }
-
+ return util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err))
+ }
+ if jobMode {
+ job.ReqResources.DeviceMap = gpuDeviceMap
+ } else {
+ if step.ReqResourcesPerTask == nil {
+ step.ReqResourcesPerTask = &protos.ResourceView{
+ DeviceMap: gpuDeviceMap,
+ }
+ } else {
+ step.ReqResourcesPerTask.DeviceMap = gpuDeviceMap
+ }
}
- task.ReqResources.DeviceMap = gpuDeviceMap
}Based on past review comments, the error construction should use util.NewCraneErr for consistency.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if FlagGpusPerNode != "" { | |
| if setGresGpusFlag { | |
| return &util.CraneError{ | |
| Code: util.ErrorCmdArg, | |
| Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously", | |
| } | |
| } | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | |
| if err != nil { | |
| return &util.CraneError{ | |
| Code: util.ErrorCmdArg, | |
| Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err), | |
| } | |
| } | |
| task.ReqResources.DeviceMap = gpuDeviceMap | |
| } | |
| if FlagGpusPerNode != "" { | |
| if setGresGpusFlag { | |
| return util.NewCraneErr(util.ErrorCmdArg, "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously") | |
| } | |
| gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode) | |
| if err != nil { | |
| return util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err)) | |
| } | |
| if jobMode { | |
| job.ReqResources.DeviceMap = gpuDeviceMap | |
| } else { | |
| if step.ReqResourcesPerTask == nil { | |
| step.ReqResourcesPerTask = &protos.ResourceView{ | |
| DeviceMap: gpuDeviceMap, | |
| } | |
| } else { | |
| step.ReqResourcesPerTask.DeviceMap = gpuDeviceMap | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In internal/crun/crun.go around lines 1315 to 1331, the code incorrectly
references an undefined variable `task` and omits step-mode handling; change the
assignment to set the DeviceMap on the correct object depending on current mode
(use `job.ReqResources.DeviceMap = gpuDeviceMap` when in job mode and
`step.ReqResources.DeviceMap = gpuDeviceMap` when in step mode), add the missing
branch to handle step mode, and replace the returned CraneError constructions
with util.NewCraneErr(...) to match project conventions (i.e., return
util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf(...)) for the parse error and the
conflicting-flag error).
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.