Skip to content

Conversation

@1daidai1
Copy link
Collaborator

@1daidai1 1daidai1 commented Jun 26, 2025

image
image
image

Summary by CodeRabbit

  • New Features

    • Added a --gpus-per-node flag to alloc, batch, and run commands to specify GPUs per node (single number or comma-separated type:count); values apply to per-task GPU requirements.
  • Bug Fixes

    • Enforced mutual exclusivity between --gres (GPU) and --gpus-per-node across all input sources.
    • Improved validation and clearer errors for malformed --gpus-per-node inputs (empty parts, mixed formats, invalid counts).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 26, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds --gpus-per-node CLI flag across cbatch/calloc/crun, a new parser util.ParseGpusPerNodeStr, and enforces mutual exclusivity between --gres gpus and --gpus-per-node; parsed GPU DeviceMap is applied to task.ReqResources.DeviceMap. cbatch's Exclusive assignment moved to a later common point.

Changes

Cohort / File(s) Summary
CLI flag registration
internal/cbatch/cmd.go, internal/calloc/cmd.go, internal/crun/cmd.go
Add exported FlagGpusPerNode string and register the --gpus-per-node CLI flag with usage text.
GPU-per-node parser
internal/util/string.go
Add ParseGpusPerNodeStr(string) (*protos.DeviceMap, error) to parse either a single integer total or comma-separated type:count entries, validate input, and return a DeviceMap or detailed errors.
Command parsing & enforcement
internal/cbatch/cbatch.go, internal/calloc/calloc.go, internal/crun/crun.go
Introduce setGresGpusFlag / setGpusPerNodeFlag guards; detect GPU usage in parsed --gres; enforce mutual exclusion with --gpus-per-node (error on conflict); parse --gpus-per-node via util and assign resulting DeviceMap to task.ReqResources.DeviceMap; move task.Exclusive assignment to a later common location in cbatch. Update error messages for conflicts and parse failures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant CLI as cbatch/calloc/crun
  participant G as ParseGres (parser)
  participant Util as util.ParseGpusPerNodeStr
  participant Task as task.ReqResources

  U->>CLI: invoke CLI with flags (--gres?, --gpus-per-node?)
  CLI->>G: parse --gres (if present) → gresMap
  G-->>CLI: gresMap
  CLI->>CLI: setGresGpusFlag = has(gresMap,"gpu")
  alt --gpus-per-node provided
    CLI->>CLI: if setGresGpusFlag true → conflict
    alt conflict
      CLI-->>U: Error — Cannot specify both --gres gpus and --gpus-per-node
    else no conflict
      CLI->>Util: ParseGpusPerNodeStr(flag)
      Util-->>CLI: DeviceMap or parse error
      alt parse ok
        CLI->>Task: task.ReqResources.DeviceMap = parsed DeviceMap
      else parse error
        CLI-->>U: Invalid --gpus-per-node (detailed error)
      end
    end
  else only --gres or none
    CLI->>Task: task.ReqResources.DeviceMap = gresMap (if gpu present) or unchanged
  end
  note right of CLI: cbatch: task.Exclusive assigned later in shared path
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • L-Xiafeng
  • NamelessOIer
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add cbatch gpus-per-node' accurately describes the main change: introducing a new --gpus-per-node flag to the cbatch command across multiple files and enforcing mutual exclusivity with --gres.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/cbatchgpus-pernode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@1daidai1
Copy link
Collaborator Author

PKUHPC/CraneSched#524

@1daidai1 1daidai1 changed the title Dev/cbatchgpus pernode Add cbatch gpus-per-node Jun 26, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/util/string.go (1)

758-758: Improve error message clarity.

The error message "mixed type: number and number format is not allowed" is confusing and doesn't clearly explain what formats cannot be mixed.

-				return nil, fmt.Errorf("mixed type: number and number format is not allowed")
+				return nil, fmt.Errorf("cannot mix simple number format with type:count format")
-				return nil, fmt.Errorf("mixed type: number and number format is not allowed")
+				return nil, fmt.Errorf("cannot mix type:count format with simple number format")

Also applies to: 772-772

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5b4c3a and 4de275f.

📒 Files selected for processing (3)
  • internal/cbatch/CmdArgParser.go (2 hunks)
  • internal/cbatch/cbatch.go (2 hunks)
  • internal/util/string.go (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
internal/cbatch/cbatch.go (2)
internal/util/string.go (1)
  • ParseGpusPerNodeStr (739-788)
internal/cbatch/CmdArgParser.go (1)
  • FlagGpusPerNode (52-52)
🔇 Additional comments (3)
internal/util/string.go (1)

739-788: Well-implemented GPU parsing function.

The function correctly handles both simple numeric and type-specific GPU formats with proper validation and error handling. The logic for enforcing mutual exclusion between the two formats is sound.

internal/cbatch/CmdArgParser.go (1)

52-52: Clean flag implementation following established patterns.

The new --gpus-per-node flag is properly declared and registered with clear usage documentation and examples.

Also applies to: 167-167

internal/cbatch/cbatch.go (1)

86-92: Consistent argument processing implementation.

The --gpus-per-node argument processing follows the established patterns and correctly integrates with the existing resource management system. Error handling and logging are consistent with other arguments.

Note that both --gres and --gpus-per-node set task.ReqResources.DeviceMap, meaning the last one processed will override the other. Verify this is the intended behavior:

#!/bin/bash
# Description: Check if there are any validation rules preventing conflicting GPU specifications

# Search for validation logic that might prevent using both --gres and --gpus-per-node
rg -A 10 -B 5 "DeviceMap.*gres|gres.*DeviceMap" 

# Look for any documentation about the interaction between these flags
fd -e md -e txt | xargs rg -l "gres|gpus-per-node" | head -5 | xargs cat

Also applies to: 200-207

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4de275f and 4d3b5b5.

📒 Files selected for processing (6)
  • internal/calloc/CmdArgParser.go (2 hunks)
  • internal/calloc/calloc.go (1 hunks)
  • internal/cbatch/cbatch.go (2 hunks)
  • internal/crun/CmdArgParser.go (2 hunks)
  • internal/crun/crun.go (1 hunks)
  • internal/util/string.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/cbatch/cbatch.go
  • internal/util/string.go
🧰 Additional context used
🧬 Code Graph Analysis (2)
internal/crun/crun.go (3)
internal/crun/CmdArgParser.go (1)
  • FlagGpusPerNode (44-44)
internal/util/string.go (1)
  • ParseGpusPerNodeStr (739-788)
internal/util/err.go (2)
  • CraneError (44-47)
  • ErrorCmdArg (37-37)
internal/calloc/calloc.go (3)
internal/calloc/CmdArgParser.go (1)
  • FlagGpusPerNode (43-43)
internal/util/string.go (1)
  • ParseGpusPerNodeStr (739-788)
internal/util/err.go (2)
  • CraneError (44-47)
  • ErrorCmdArg (37-37)
🔇 Additional comments (2)
internal/crun/CmdArgParser.go (1)

44-44: LGTM: Clean flag declaration and registration

The flag variable declaration and registration follow the established patterns in the codebase. The help text is descriptive and includes useful examples for users.

Also applies to: 111-111

internal/calloc/CmdArgParser.go (1)

43-43: LGTM: Consistent flag implementation across commands

The flag declaration and registration are consistent with the crun command, providing a uniform user experience. The implementation follows established patterns and includes clear help text.

Also applies to: 100-100

@1daidai1 1daidai1 force-pushed the dev/cbatchgpus-pernode branch 2 times, most recently from c84eda8 to e6f6a74 Compare July 1, 2025 06:13
@L-Xiafeng L-Xiafeng requested a review from Copilot July 29, 2025 01:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for a new --gpus-per-node command-line flag to the batch, calloc, and crun commands, enabling users to specify GPU requirements per node with flexible formats. The implementation includes validation to prevent simultaneous use of conflicting GPU specification flags.

  • Adds a new parsing function ParseGpusPerNodeStr to handle the --gpus-per-node format
  • Implements mutual exclusion validation between --gres (when specifying GPUs) and --gpus-per-node flags
  • Integrates the new flag across all three command interfaces (batch, calloc, crun)

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/util/string.go Adds ParseGpusPerNodeStr function to parse GPU specifications in the new format
internal/crun/crun.go Integrates --gpus-per-node flag with validation against conflicting --gres usage
internal/crun/CmdArgParser.go Adds command-line flag definition for --gpus-per-node in crun
internal/cbatch/cbatch.go Implements --gpus-per-node support in batch processing with conflict detection
internal/cbatch/CmdArgParser.go Adds command-line flag definition for --gpus-per-node in cbatch
internal/calloc/calloc.go Adds --gpus-per-node flag support with validation in calloc
internal/calloc/CmdArgParser.go Adds command-line flag definition for --gpus-per-node in calloc

if mode == "" {
mode = "type"
} else if mode != "type" {
return nil, fmt.Errorf("mixed type: number and number format is not allowed")
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "mixed type: number and number format is not allowed" is confusing. It should be clearer, such as "cannot mix simple numeric format with type:count format"

Suggested change
return nil, fmt.Errorf("mixed type: number and number format is not allowed")
return nil, fmt.Errorf("cannot mix simple numeric format with type:count format")

Copilot uses AI. Check for mistakes.
Code: util.ErrorCmdArg,
Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
}

Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There's an extra blank line before the closing brace. This inconsistent spacing affects code readability.

Suggested change

Copilot uses AI. Check for mistakes.
Code: util.ErrorCmdArg,
Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
}

Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There's an extra blank line before the closing brace. This inconsistent spacing affects code readability and should match the formatting in other similar blocks.

Suggested change

Copilot uses AI. Check for mistakes.
if FlagExclusive {
task.Exclusive = true
}
if FlagOpenMode != "" {
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the FlagExclusive handling code appears to be unintentional. The exclusive flag functionality has been deleted without apparent justification, which could break existing functionality.

Copilot uses AI. Check for mistakes.
if FlagComment != "" {
structExtraFromCli.Comment = FlagComment
}
if FlagExclusive {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这为什么删了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前rebase导致多了一个if FlagExclusive判断,修改这个需求发现的,所以直接删除了

}
gpusPerNodeStrList := strings.Split(gpuPerNodeStr, ",")

mode := "" // "digit" or "type"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个mode有什么用?好像完全没必要限制用户只能用一种方式?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image slurm不支持混用

Copy link
Collaborator

@L-Xiafeng L-Xiafeng Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那也没必要用个string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改,改成enum

@1daidai1 1daidai1 force-pushed the dev/cbatchgpus-pernode branch from e6f6a74 to 076ead0 Compare September 16, 2025 09:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
internal/crun/cmd.go (1)

117-117: Fix help text: clarify semantics and casing.

The flag name is “gpus-per-node” but the help says “per task” and uses inconsistent casing.

Apply this diff:

-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "Gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"")
+RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")
internal/calloc/cmd.go (1)

103-103: Fix help text: clarify semantics and casing.

Match crun/cbatch wording.

Apply this diff:

-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "Gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"")
+RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")
internal/cbatch/cmd.go (1)

175-175: Fix help text: clarify semantics and casing.

Keep wording consistent across commands.

Apply this diff:

-RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "gpus-per-node required per task, format: [type:]<number>[,[type:]<number>...]. eg: \"4\" or \"a100:1,volta:1\"")
+RootCmd.Flags().StringVar(&FlagGpusPerNode, "gpus-per-node", "", "GPUs per node; format: [type:]<number>[,[type:]<number>...], e.g., \"4\" or \"a100:1,volta:1\"")
internal/cbatch/cbatch.go (1)

96-108: Optional: centralize map merge to avoid duplication.

Consider a small helper (e.g., util.MergeDeviceMaps(dst, src) with overwrite semantics per resource name) and reuse in both blocks.

Also applies to: 228-240

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6f6a74 and 076ead0.

📒 Files selected for processing (7)
  • internal/calloc/calloc.go (2 hunks)
  • internal/calloc/cmd.go (2 hunks)
  • internal/cbatch/cbatch.go (3 hunks)
  • internal/cbatch/cmd.go (2 hunks)
  • internal/crun/cmd.go (2 hunks)
  • internal/crun/crun.go (2 hunks)
  • internal/util/string.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/calloc/calloc.go
  • internal/util/string.go
  • internal/crun/crun.go
🧰 Additional context used
🧬 Code graph analysis (4)
internal/calloc/cmd.go (2)
internal/cbatch/cmd.go (1)
  • FlagGpusPerNode (52-52)
internal/crun/cmd.go (1)
  • FlagGpusPerNode (46-46)
internal/cbatch/cmd.go (2)
internal/calloc/cmd.go (1)
  • FlagGpusPerNode (43-43)
internal/crun/cmd.go (1)
  • FlagGpusPerNode (46-46)
internal/crun/cmd.go (2)
internal/calloc/cmd.go (1)
  • FlagGpusPerNode (43-43)
internal/cbatch/cmd.go (1)
  • FlagGpusPerNode (52-52)
internal/cbatch/cbatch.go (4)
internal/util/string.go (2)
  • ParseGpusPerNodeStr (761-810)
  • ParseGres (714-759)
internal/calloc/cmd.go (2)
  • FlagGres (37-37)
  • FlagGpusPerNode (43-43)
internal/cbatch/cmd.go (2)
  • FlagGres (45-45)
  • FlagGpusPerNode (52-52)
internal/crun/cmd.go (2)
  • FlagGres (41-41)
  • FlagGpusPerNode (46-46)
🪛 GitHub Actions: Go Code Quality Check
internal/crun/cmd.go

[error] 1-1: gofmt formatting check failed. The file 'internal/crun/cmd.go' is not properly formatted. Please run 'gofmt -w .' to fix.

🔇 Additional comments (7)
internal/crun/cmd.go (2)

46-46: LGTM on flag var exposure.

New FlagGpusPerNode aligns with existing flag vars.


1-119: gofmt failure — verification unavailable in sandbox

gofmt is not installed in the verification environment ("gofmt: command not found"). Run locally: gofmt -s -w internal/crun/cmd.go, commit the formatted file, and push.

internal/calloc/cmd.go (1)

43-43: LGTM on flag var exposure.

internal/cbatch/cmd.go (1)

52-52: LGTM on flag var exposure.

internal/cbatch/cbatch.go (3)

67-69: LGTM on mutual‑exclusion guards introduction.

setGresGpusFlag and setGpusPerNodeFlag make the constraint explicit.


88-95: LGTM on detecting GPU in --gres for exclusion logic.


218-227: LGTM on CLI --gres handling and guard.

Comment on lines +96 to 128
case "--gpus-per-node":
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
if err != nil {
log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
return false, nil
}
task.ReqResources.DeviceMap = gpuDeviceMap
case "--ntasks-per-node":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t clobber non‑GPU devices when --gpus-per-node is used after --gres.

Assigning gpuDeviceMap overwrites any non‑GPU entries previously set via --gres (e.g., nic:1). Merge instead, overwriting only the GPU entry.

Apply this diff:

-        setGpusPerNodeFlag = true
-        gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
-        if err != nil {
-                log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
-                return false, nil
-        }
-        task.ReqResources.DeviceMap = gpuDeviceMap
+        setGpusPerNodeFlag = true
+        gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
+        if err != nil {
+                log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
+                return false, nil
+        }
+        if task.ReqResources.DeviceMap == nil || task.ReqResources.DeviceMap.NameTypeMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 {
+                task.ReqResources.DeviceMap = gpuDeviceMap
+        } else {
+                // Overwrite/insert GPU mapping while preserving non-GPU devices from --gres.
+                for name, tcm := range gpuDeviceMap.NameTypeMap {
+                        task.ReqResources.DeviceMap.NameTypeMap[name] = tcm
+                }
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case "--gpus-per-node":
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
if err != nil {
log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
return false, nil
}
task.ReqResources.DeviceMap = gpuDeviceMap
case "--ntasks-per-node":
case "--gpus-per-node":
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
if err != nil {
log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
return false, nil
}
if task.ReqResources.DeviceMap == nil || task.ReqResources.DeviceMap.NameTypeMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 {
task.ReqResources.DeviceMap = gpuDeviceMap
} else {
// Overwrite/insert GPU mapping while preserving non-GPU devices from --gres.
for name, tcm := range gpuDeviceMap.NameTypeMap {
task.ReqResources.DeviceMap.NameTypeMap[name] = tcm
}
}
case "--ntasks-per-node":
🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 96 to 108, assigning gpuDeviceMap
directly to task.ReqResources.DeviceMap clobbers any non-GPU device entries
previously set via --gres; instead merge gpuDeviceMap into the existing
DeviceMap so only the GPU entry is replaced. Implement: if
task.ReqResources.DeviceMap is nil, set it to a new map; otherwise iterate
gpuDeviceMap and set each key/value into task.ReqResources.DeviceMap (this will
overwrite the "gpu" entry but preserve other device keys like "nic"). Keep error
handling as-is and only replace the assignment with this merge logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
internal/util/string.go (3)

807-812: Duplicate type handling: overwrite vs. accumulate.

Currently the last occurrence overwrites prior counts for the same GPU type. If users may repeat a type, consider summing (with overflow guard) or explicitly error on duplicates to avoid silent surprises.

Option A (accumulate counts):

-			typeCountMap.TypeCountMap[gpuType] = val
+			if prev, ok := typeCountMap.TypeCountMap[gpuType]; ok {
+				if prev > math.MaxUint64-val {
+					return nil, fmt.Errorf("count overflow for type %q", gpuType)
+				}
+				val += prev
+			}
+			typeCountMap.TypeCountMap[gpuType] = val

Option B (reject duplicates):

-			typeCountMap.TypeCountMap[gpuType] = val
+			if _, exists := typeCountMap.TypeCountMap[gpuType]; exists {
+				return nil, fmt.Errorf("duplicate gpu type %q in --gpus-per-node", gpuType)
+			}
+			typeCountMap.TypeCountMap[gpuType] = val

761-763: Add GoDoc to lock in the accepted formats.

A brief comment helps future readers and keeps CLI/docs consistent.

+// ParseGpusPerNodeStr parses the --gpus-per-node flag.
+// Accepted forms (mutually exclusive):
+//   - "N" (simple count, must be the only part)
+//   - "type:count[,type:count...]"
+// Returns a DeviceMap with NameTypeMap["gpu"] populated.
 func ParseGpusPerNodeStr(gpuPerNodeStr string) (*protos.DeviceMap, error) {

761-819: Add table-driven tests for edge cases.

Recommend tests for: "", "2", "a100:2", "a100:2,l40:1", "2,a100:1" (mixing), "a100:" (empty count), ":" (empty type and count), "a100:0", " , ", "a100:2,a100:3" (duplicate type), "a100:1:2" (too many colons).

I can draft a _test.go with these cases if helpful.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 71bd294 and 4d3d65f.

📒 Files selected for processing (2)
  • internal/crun/cmd.go (2 hunks)
  • internal/util/string.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/crun/cmd.go
🔇 Additional comments (1)
internal/util/string.go (1)

761-819: Solid parser; aligns with Slurm-style rules.

Covers empty input, format exclusivity, index constraints, and numeric validation. Looks good.

Please confirm downstream expects key "gpu" in NameTypeMap and that typed mode (TypeCountMap set, Total=0) matches existing ParseGres semantics.

Copy link
Collaborator

@L-Xiafeng L-Xiafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

等CI恢复测试后合入

if FlagGres != "" {
gresMap := util.ParseGres(FlagGres)
task.ReqResources.DeviceMap = gresMap
if _, exist := task.ReqResources.DeviceMap.NameTypeMap["gpu"]; exist {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gpu"也单独写个常量吧

}
if FlagGpusPerNode != "" {
if setGresGpusFlag {
return &util.CraneError{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个这样的函数NewCraneErr,统一下不手动构造这个 err

@1daidai1 1daidai1 force-pushed the dev/cbatchgpus-pernode branch from 4d3d65f to 089c061 Compare December 23, 2025 03:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/crun/crun.go (1)

1181-1199: Critical: Syntax error and undefined variable cause build failure.

The code has multiple issues causing the pipeline to fail:

  1. Line 1186 references task.ReqResources.DeviceMap but the variable is named job in this context
  2. The if block starting at line 1186 is missing its closing brace and body
  3. The logic structure is incomplete - the setGresGpusFlag = true assignment at line 1187 appears orphaned
🔎 Proposed fix
 	setGresGpusFlag := false
 	if FlagGres != "" {
 		gresMap := util.ParseGres(FlagGres)
 		if jobMode {
 			job.ReqResources.DeviceMap = gresMap
-			if _, exist := task.ReqResources.DeviceMap.NameTypeMap["gpu"]; exist {
-			setGresGpusFlag = true
+			if _, exist := gresMap.NameTypeMap["gpu"]; exist {
+				setGresGpusFlag = true
+			}
 		} else {
 			if len(gresMap.NameTypeMap) != 0 {
♻️ Duplicate comments (1)
internal/cbatch/cbatch.go (1)

127-127: Consider merging DeviceMap to preserve non-GPU devices.

When --gpus-per-node is used after --gres with non-GPU devices (e.g., --gres nic:1), the current implementation overwrites the entire DeviceMap, losing the non-GPU device entries.

🔎 Proposed merge logic
-			task.ReqResources.DeviceMap = gpuDeviceMap
+			if task.ReqResources.DeviceMap == nil || len(task.ReqResources.DeviceMap.NameTypeMap) == 0 {
+				task.ReqResources.DeviceMap = gpuDeviceMap
+			} else {
+				// Merge GPU mapping while preserving non-GPU devices from --gres
+				for name, tcm := range gpuDeviceMap.NameTypeMap {
+					task.ReqResources.DeviceMap.NameTypeMap[name] = tcm
+				}
+			}

This follows the suggestion from a past review comment to merge instead of overwrite.

Also applies to: 262-262

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d3d65f and 089c061.

📒 Files selected for processing (7)
  • internal/calloc/calloc.go
  • internal/calloc/cmd.go
  • internal/cbatch/cbatch.go
  • internal/cbatch/cmd.go
  • internal/crun/cmd.go
  • internal/crun/crun.go
  • internal/util/string.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/crun/cmd.go
  • internal/calloc/cmd.go
  • internal/calloc/calloc.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-24T16:05:47.156Z
Learnt from: NamelessOIer
Repo: PKUHPC/CraneSched-FrontEnd PR: 256
File: internal/cqueue/cqueue.go:350-356
Timestamp: 2025-04-24T16:05:47.156Z
Learning: In cqueue.go, the function ProcessAllocMemPerNode should use task.AllocatedResView.AllocatableRes.MemoryLimitBytes (allocated resources) rather than task.ReqResView.AllocatableRes.MemoryLimitBytes (requested resources) to correctly display allocated memory per node.

Applied to files:

  • internal/crun/crun.go
  • internal/cbatch/cbatch.go
🧬 Code graph analysis (2)
internal/crun/crun.go (3)
internal/crun/cmd.go (2)
  • FlagGres (45-45)
  • FlagGpusPerNode (50-50)
internal/util/string.go (2)
  • ParseGres (1058-1103)
  • ParseGpusPerNodeStr (1105-1163)
internal/util/err.go (2)
  • CraneError (43-47)
  • ErrorCmdArg (36-36)
internal/cbatch/cmd.go (2)
internal/crun/cmd.go (1)
  • FlagGpusPerNode (50-50)
internal/calloc/cmd.go (1)
  • FlagGpusPerNode (44-44)
🪛 GitHub Actions: Go Code Quality Check
internal/crun/crun.go

[error] 1426-1426: Build failed: expected '}', found 'EOF'.

internal/cbatch/cbatch.go

[error] 111-111: go vet ./...: cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement

🪛 GitHub Check: go-check
internal/cbatch/cbatch.go

[failure] 125-125:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 119-119:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 111-111:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 260-260:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 254-254:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 245-245:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement

🔇 Additional comments (3)
internal/cbatch/cmd.go (1)

53-53: LGTM!

The new FlagGpusPerNode variable and its CLI flag registration are correctly implemented, following the established pattern used for other flags in this file and consistent with similar declarations in internal/crun/cmd.go and internal/calloc/cmd.go.

Also applies to: 159-159

internal/util/string.go (1)

1105-1163: LGTM!

The ParseGpusPerNodeStr function is well-implemented with:

  • Clear mode-based parsing to enforce format consistency (simple number vs type:count)
  • Comprehensive validation with descriptive error messages
  • Proper handling of edge cases (empty input, empty parts, mixed formats, invalid numbers)
  • Consistent behavior with ParseGres by treating zero values as invalid

The implementation addresses all previous review feedback regarding variable naming and error message clarity.

internal/cbatch/cbatch.go (1)

89-90: LGTM on mutual exclusion guards.

The setGresGpusFlag and setGpusPerNodeFlag guards correctly implement mutual exclusion between --gres with GPU specifications and --gpus-per-node. The logic properly handles both script arguments and CLI flags.

Comment on lines +108 to +127
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
case "--gpus-per-node":
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
if err != nil {
log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
return false, nil
}
task.ReqResources.DeviceMap = gpuDeviceMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Incorrect return type causes build failure.

The function BuildCbatchJob returns (*protos.TaskToCtld, error), but lines 111, 119, and 125 use return false, nil which is a type mismatch. This causes the pipeline build failure.

🔎 Proposed fix
 		case "--gres":
 			gresMap := util.ParseGres(arg.val)
 			if _, exist := gresMap.NameTypeMap["gpu"]; exist {
 				if setGpusPerNodeFlag {
-					log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
-					return false, nil
+					return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
 				}
 				setGresGpusFlag = true
 			}
 			task.ReqResources.DeviceMap = gresMap
 		case "--gpus-per-node":
 			if setGresGpusFlag {
-				log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
-				return false, nil
+				return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
 			}
 			setGpusPerNodeFlag = true
 			gpuDeviceMap, err := util.ParseGpusPerNodeStr(arg.val)
 			if err != nil {
-				log.Errorf("Invalid argument: %v in script: %v", arg.name, err)
-				return false, nil
+				return nil, fmt.Errorf("invalid argument: %s value in script: %w", arg.name, err)
 			}
 			task.ReqResources.DeviceMap = gpuDeviceMap
🧰 Tools
🪛 GitHub Actions: Go Code Quality Check

[error] 111-111: go vet ./...: cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement

🪛 GitHub Check: go-check

[failure] 125-125:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 119-119:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 111-111:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement

🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 108 to 127, the error branches
currently return the wrong types (they use "return false, nil" while
BuildCbatchJob's signature is (*protos.TaskToCtld, error)); replace each "return
false, nil" with "return nil, fmt.Errorf(...)" (or errors.New(...)) using the
same error message text as logged (e.g., "cannot specify both --gres gpus and
--gpus-per-node flags simultaneously" and the invalid-argument formatting), and
add the required import for fmt or errors if not present.

Comment on lines 240 to 263
if cmd.Flags().Changed("gres") {
task.ReqResources.DeviceMap = util.ParseGres(FlagGres)
gresMap := util.ParseGres(FlagGres)
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
}
if cmd.Flags().Changed("gpus-per-node") {
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
log.Errorf("Invalid argument: %v", err)
return false, nil
}
task.ReqResources.DeviceMap = gpuDeviceMap
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Incorrect return type in CLI flag handling section.

Same issue as the script parsing section - lines 245, 254, and 260 use return false, nil instead of the correct return type (*protos.TaskToCtld, error).

🔎 Proposed fix
 	if cmd.Flags().Changed("gres") {
 		gresMap := util.ParseGres(FlagGres)
 		if _, exist := gresMap.NameTypeMap["gpu"]; exist {
 			if setGpusPerNodeFlag {
-				log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
-				return false, nil
+				return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
 			}
 			setGresGpusFlag = true
 		}
 		task.ReqResources.DeviceMap = gresMap
 	}
 	if cmd.Flags().Changed("gpus-per-node") {
 		if setGresGpusFlag {
-			log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
-			return false, nil
+			return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
 		}
 		setGpusPerNodeFlag = true
 		gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
 		if err != nil {
-			log.Errorf("Invalid argument: %v", err)
-			return false, nil
+			return nil, fmt.Errorf("invalid argument: invalid --gpus-per-node: %w", err)
 		}
 		task.ReqResources.DeviceMap = gpuDeviceMap
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if cmd.Flags().Changed("gres") {
task.ReqResources.DeviceMap = util.ParseGres(FlagGres)
gresMap := util.ParseGres(FlagGres)
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
}
if cmd.Flags().Changed("gpus-per-node") {
if setGresGpusFlag {
log.Errorf("Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
return false, nil
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
log.Errorf("Invalid argument: %v", err)
return false, nil
}
task.ReqResources.DeviceMap = gpuDeviceMap
}
if cmd.Flags().Changed("gres") {
gresMap := util.ParseGres(FlagGres)
if _, exist := gresMap.NameTypeMap["gpu"]; exist {
if setGpusPerNodeFlag {
return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGresGpusFlag = true
}
task.ReqResources.DeviceMap = gresMap
}
if cmd.Flags().Changed("gpus-per-node") {
if setGresGpusFlag {
return nil, fmt.Errorf("cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
setGpusPerNodeFlag = true
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
return nil, fmt.Errorf("invalid argument: invalid --gpus-per-node: %w", err)
}
task.ReqResources.DeviceMap = gpuDeviceMap
}
🧰 Tools
🪛 GitHub Check: go-check

[failure] 260-260:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 254-254:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement


[failure] 245-245:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement

🤖 Prompt for AI Agents
In internal/cbatch/cbatch.go around lines 240 to 263, the CLI flag handling
returns the wrong types (uses "return false, nil"); change those to the
function's actual return types: for the mutual-exclusion errors replace "return
false, nil" with "return nil, fmt.Errorf(\"Cannot specify both --gres gpus and
--gpus-per-node flags simultaneously\")" (or an errors.New with the same
message) and for the ParseGpusPerNodeStr error return "return nil, err" (or wrap
with fmt.Errorf) so the function returns (*protos.TaskToCtld, error); also add
the fmt import if you use fmt.Errorf.

Comment on lines +1315 to +1331
if FlagGpusPerNode != "" {
if setGresGpusFlag {
return &util.CraneError{
Code: util.ErrorCmdArg,
Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously",
}
}
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
return &util.CraneError{
Code: util.ErrorCmdArg,
Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
}

}
task.ReqResources.DeviceMap = gpuDeviceMap
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Undefined variable task and missing step mode handling.

Line 1330 references task.ReqResources.DeviceMap but task is not defined in this scope. The variable should be job for job mode or step for step mode. Additionally, step mode handling is completely missing.

🔎 Proposed fix
 	if FlagGpusPerNode != "" {
 		if setGresGpusFlag {
-			return &util.CraneError{
-				Code:    util.ErrorCmdArg,
-				Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously",
-			}
+			return util.NewCraneErr(util.ErrorCmdArg, "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
 		}
 		gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
 		if err != nil {
-			return &util.CraneError{
-				Code:    util.ErrorCmdArg,
-				Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
-			}
-
+			return util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err))
+		}
+		if jobMode {
+			job.ReqResources.DeviceMap = gpuDeviceMap
+		} else {
+			if step.ReqResourcesPerTask == nil {
+				step.ReqResourcesPerTask = &protos.ResourceView{
+					DeviceMap: gpuDeviceMap,
+				}
+			} else {
+				step.ReqResourcesPerTask.DeviceMap = gpuDeviceMap
+			}
 		}
-		task.ReqResources.DeviceMap = gpuDeviceMap
 	}

Based on past review comments, the error construction should use util.NewCraneErr for consistency.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if FlagGpusPerNode != "" {
if setGresGpusFlag {
return &util.CraneError{
Code: util.ErrorCmdArg,
Message: "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously",
}
}
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
return &util.CraneError{
Code: util.ErrorCmdArg,
Message: fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err),
}
}
task.ReqResources.DeviceMap = gpuDeviceMap
}
if FlagGpusPerNode != "" {
if setGresGpusFlag {
return util.NewCraneErr(util.ErrorCmdArg, "Cannot specify both --gres gpus and --gpus-per-node flags simultaneously")
}
gpuDeviceMap, err := util.ParseGpusPerNodeStr(FlagGpusPerNode)
if err != nil {
return util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf("Invalid argument: invalid --gpus-per-node: %s", err))
}
if jobMode {
job.ReqResources.DeviceMap = gpuDeviceMap
} else {
if step.ReqResourcesPerTask == nil {
step.ReqResourcesPerTask = &protos.ResourceView{
DeviceMap: gpuDeviceMap,
}
} else {
step.ReqResourcesPerTask.DeviceMap = gpuDeviceMap
}
}
}
🤖 Prompt for AI Agents
In internal/crun/crun.go around lines 1315 to 1331, the code incorrectly
references an undefined variable `task` and omits step-mode handling; change the
assignment to set the DeviceMap on the correct object depending on current mode
(use `job.ReqResources.DeviceMap = gpuDeviceMap` when in job mode and
`step.ReqResources.DeviceMap = gpuDeviceMap` when in step mode), add the missing
branch to handle step mode, and replace the returned CraneError constructions
with util.NewCraneErr(...) to match project conventions (i.e., return
util.NewCraneErr(util.ErrorCmdArg, fmt.Sprintf(...)) for the parse error and the
conflicting-flag error).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants