-
Notifications
You must be signed in to change notification settings - Fork 16
Add job env #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add job env #318
Conversation
📝 WalkthroughWalkthroughPopulates task submit metadata (hostname and submit directory) across allocator, runner, and batch submission code paths and adds protobuf fields ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
protos/PublicDefs.proto (2)
170-172: Submit hostname addition looks good; add a brief comment for consumers.This is wire‑compatible and useful for auditing/routing. Add a short comment so downstream readers understand its purpose.
- string submit_hostname = 39; + // Hostname of the node where the job was submitted (login/head node). + string submit_hostname = 39;
246-248: Add documentation fortotal_gpusand reserve field tag 28Search for any existing use of tag 28 returned no matches, so it’s safe to reserve it to prevent future reuse. Please consider the following updates:
• Clarify the meaning of
submit_hostnameandtotal_gpusin comments.
• Insert areserved 28;line immediately aftertotal_gpusto block accidental reuse.Suggested diff in
protos/PublicDefs.proto(insidemessage TaskToD):message TaskToD { … // Hostname where the job was submitted. string submit_hostname = 26; // Total number of GPUs requested/allocated for this task // across all nodes and device types. uint32 total_gpus = 27; + // Reserve tag 28 to prevent accidental reuse. + reserved 28; // Not used now. string extra_attr = 29; }internal/cbatch/cbatch.go (1)
304-310: Don’t hard‑fail on hostname lookup; fall back and proceed.os.Hostname() can fail in minimal containers or restricted envs. This is metadata; it shouldn’t block submission. Fall back to $HOSTNAME/$HOST and proceed (possibly empty) with a warning.
- submitHostname, err := os.Hostname() - if err != nil { - log.Errorf("Failed to get hostname of the submitting host: %v", err) - return false, nil - } - task.SubmitHostname = submitHostname + submitHostname, err := os.Hostname() + if err != nil { + // Try env fallbacks; do not block submission on metadata failure. + if h := os.Getenv("HOSTNAME"); h != "" { + submitHostname = h + } else if h := os.Getenv("HOST"); h != "" { + submitHostname = h + } else { + log.Warnf("Failed to get hostname of the submitting host: %v; proceeding without it", err) + submitHostname = "" + } + } + task.SubmitHostname = submitHostname
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
internal/cbatch/cbatch.go(1 hunks)protos/PublicDefs.proto(2 hunks)protos/Supervisor.proto(1 hunks)
🔇 Additional comments (1)
internal/cbatch/cbatch.go (1)
301-306: No downstream usage of SubmitHostname foundA repository-wide search for
SubmitHostnameonly located its assignment ininternal/cbatch/cbatch.goand no references in other Go or proto files. Since no consumers exist, an emptySubmitHostnamecannot cause downstream errors and requires no additional handling.• internal/cbatch/cbatch.go: line 309 –
task.SubmitHostname = submitHostname
protos/Supervisor.proto
Outdated
| string log_dir = 11; | ||
| string crane_cluster_name = 12; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Add a brief field doc and confirm backward compatibility assumptions for tag 12.
The new field is fine and backward compatible as an optional string. Please add a short comment describing semantics (e.g., which components consume it and whether it’s informational vs. authoritative). Also, double‑check no existing field previously used tag 12.
Run to spot generated usages and ensure codegen picked it up:
Apply this doc comment near the new line:
- string crane_cluster_name = 12;
+ // Logical name of the Crane cluster the supervisor belongs to (informational; optional).
+ string crane_cluster_name = 12;🏁 Script executed:
#!/bin/bash
# Look for the new field in generated code and call sites.
rg -n --glob '!**/node_modules/**' -C2 'crane_cluster_name|CraneClusterName|craneClusterName'Length of output: 318
Add documentation comment for crane_cluster_name (mandatory refactor required)
I’ve verified that no other messages or generated code reference tag 12 aside from Supervisor.proto itself, confirming there’s no collision with existing fields. Please add a brief doc comment to clarify its semantics and preserve backward compatibility:
• File: protos/Supervisor.proto, around line 62
– Insert the comment immediately above the new field.
Suggested diff:
- string crane_cluster_name = 12;
+ // Logical name of the Crane cluster the supervisor belongs to (informational; optional).
+ string crane_cluster_name = 12;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| string log_dir = 11; | |
| string crane_cluster_name = 12; | |
| string log_dir = 11; | |
| // Logical name of the Crane cluster the supervisor belongs to (informational; optional). | |
| string crane_cluster_name = 12; |
🤖 Prompt for AI Agents
In protos/Supervisor.proto around lines 61 to 63, the new field
crane_cluster_name lacks a documentation comment; add a brief doc comment
immediately above the string crane_cluster_name = 12; line describing its
semantics (e.g., that it is an optional identifier for the Crane cluster used by
the supervisor, how it is populated/used, and that it is backward-compatible and
optional so older clients will ignore it) to clarify intent and preserve
backward compatibility.
4fc8edd to
5876a37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/cbatch/cbatch.go(1 hunks)protos/PublicDefs.proto(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- protos/PublicDefs.proto
🧰 Additional context used
🪛 GitHub Actions: Go Code Quality Check
internal/cbatch/cbatch.go
[error] 324-324: go vet: cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement.
🪛 GitHub Check: go-check
internal/cbatch/cbatch.go
[failure] 324-324:
cannot use false (untyped bool constant) as *protos.TaskToCtld value in return statement
66193c5 to
97be0e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal/cbatch/cbatch.go (2)
321-325: Compilation error fixed, but consider making hostname retrieval non-fatal.The critical return type mismatch has been correctly resolved. However, as noted in the previous review, verification confirmed that
SubmitHostnameis metadata-only (stored but not read by application logic). Failing the entire job submission whenos.Hostname()fails may be overly strict.Consider logging the error and continuing with a fallback value:
- submitHostname, err := os.Hostname() - if err != nil { - return nil, fmt.Errorf("Failed to get hostname of the submitting host: %v", err) - } - task.SubmitHostname = submitHostname + submitHostname, err := os.Hostname() + if err != nil { + log.Warnf("Failed to get hostname of the submitting host: %v", err) + task.SubmitHostname = "unknown" + } else { + task.SubmitHostname = submitHostname + }Based on learnings
323-323: Error message should start with lowercase.Per Go conventions, error messages that will be wrapped should start with a lowercase letter.
Apply this diff:
- return nil, fmt.Errorf("Failed to get hostname of the submitting host: %v", err) + return nil, fmt.Errorf("failed to get hostname of the submitting host: %w", err)Note: Also changed
%vto%wto properly wrap the error.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/cbatch/cbatch.go(1 hunks)protos/PublicDefs.proto(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- protos/PublicDefs.proto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/cbatch/cbatch.go(1 hunks)protos/PublicDefs.proto(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/cbatch/cbatch.go
🔇 Additional comments (1)
protos/PublicDefs.proto (1)
1-40: Clarify the PR objective and validate alignment with changes.The PR title is "Add job env," but the changes in this proto file show field removals (commenting out), not additions. No new
env-related fields are visible in the provided code. The AI summary mentions commented-out hostname code incbatch.go, which may provide context, but it is not included in this review.Please clarify:
- Does the PR objective ("Add job env") align with these field removals in
protos/PublicDefs.proto?- Are there related changes in other files (e.g.,
internal/cbatch/cbatch.go) that add new fields or implement the "job env" feature?- Is this a refactoring to remove obsolete fields, or is the PR title misleading?
protos/PublicDefs.proto
Outdated
| // string submit_hostname = 26; | ||
| // uint32 total_gpus = 27; | ||
| // repeated string exclude_nodelist = 28; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Use protobuf's reserved keyword to properly deprecate removed fields.
The three commented-out fields (submit_hostname, total_gpus, exclude_nodelist) at field numbers 26, 27, and 28 present a backwards compatibility risk. Protobuf requires explicit reservation of field numbers to prevent accidental reuse, which would silently corrupt deserialization for older clients.
Apply this diff to reserve the field numbers:
bool get_user_env = 25;
- // string submit_hostname = 26;
- // uint32 total_gpus = 27;
- // repeated string exclude_nodelist = 28;
+ reserved 26, 27, 28;
}Verify backward compatibility expectations for clients and servers that may have been relying on these fields. Confirm this is an intentional breaking change in a major version release if required.
🏁 Script executed:
#!/bin/bash
# Verify the current state of the protobuf file
cat protos/PublicDefs.proto | sed -n '325,340p'Length of output: 322
🌐 Web query:
protobuf reserved keyword backwards compatibility field numbers
💡 Result:
Short answer:
- Use the reserved keyword to block field numbers (and/or names) you remove so they cannot be reused later — this prevents breakage when old serialized data still exists. [1][2]
- Field numbers identify fields on the wire and must never be changed or reused; changing a number is equivalent to deleting the old field and adding a different field. Reserve deleted numbers with e.g. reserved 2, 15, 9 to 11; ranges are inclusive. [1][2]
- Don’t use 19000–19999 (reserved for the protobuf implementation). Valid field numbers: 1..536,870,911 (with 19000–19999 forbidden). [1][3]
- Reserved names affect only compiler checks (except some TextProto behaviour in some implementations); runtime binary encoding still carries unknown fields. [1]
References:
[1] Protocol Buffers language guide / Deleting Fields & Reserved (protobuf.dev).
[2] Proto best practices — “Do Reserve Tag Numbers” (protobuf.dev).
[3] Field number limits and special ranges (protobuf language spec / docs).
🏁 Script executed:
#!/bin/bash
# Search for any references to the removed field names in the codebase
echo "=== Searching for submit_hostname ==="
rg -n "submit_hostname" --type-list
echo "=== Searching for total_gpus ==="
rg -n "total_gpus" --type-list
echo "=== Searching for exclude_nodelist ==="
rg -n "exclude_nodelist" --type-list
echo "=== Find the message containing these fields ==="
sed -n '310,340p' protos/PublicDefs.protoLength of output: 1395
🏁 Script executed:
#!/bin/bash
# Search for references to the removed field names
echo "=== Searching for submit_hostname ==="
rg -n "submit_hostname"
echo "=== Searching for total_gpus ==="
rg -n "total_gpus"
echo "=== Searching for exclude_nodelist ==="
rg -n "exclude_nodelist"
echo "=== Message context (lines 300-345) ==="
sed -n '300,345p' protos/PublicDefs.protoLength of output: 1652
Use protobuf's reserved keyword to properly deprecate removed fields.
The three commented-out fields (submit_hostname, total_gpus, exclude_nodelist) at field numbers 26, 27, and 28 present a backwards compatibility risk. Protobuf best practice requires explicit reservation of field numbers to prevent accidental reuse, which would silently corrupt deserialization for older clients.
Apply this diff to reserve the field numbers:
bool get_user_env = 25;
- // string submit_hostname = 26;
- // uint32 total_gpus = 27;
- // repeated string exclude_nodelist = 28;
+ reserved 26, 27, 28;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // string submit_hostname = 26; | |
| // uint32 total_gpus = 27; | |
| // repeated string exclude_nodelist = 28; | |
| reserved 26, 27, 28; |
🤖 Prompt for AI Agents
In protos/PublicDefs.proto around lines 332 to 334, the three commented-out
fields (submit_hostname at 26, total_gpus at 27, exclude_nodelist at 28) must be
explicitly reserved to avoid accidental reuse; add a reserved declaration inside
the same message scope that lists the numeric field numbers 26, 27, 28 and also
reserve the original field names ("submit_hostname", "total_gpus",
"exclude_nodelist") so both numbers and names cannot be reused by future fields.
97be0e6 to
2de1e21
Compare
8219e9e to
78b233d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/calloc/calloc.go (1)
476-479: Consider extracting duplicated SubmitDir logic into a helper function.This 4-line block is duplicated identically in
internal/cbatch/cmd.go(lines 108-111) andinternal/crun/crun.go(lines 1117-1120). Extract it into a helper function to reduce duplication and improve maintainability.🔎 Proposed refactor
Add a helper function to
internal/util:// GetSubmitDir retrieves the current working directory for task submission. func GetSubmitDir() (string, error) { dir, err := os.Getwd() if err != nil { return "", WrapCraneErr(ErrorSystem, "Failed to get current working directory: %s", err) } return dir, nil }Then use it:
- task.SubmitDir, err = os.Getwd() - if err != nil { - return util.WrapCraneErr(util.ErrorSystem, "Get submit dir err: %s.", err) - } + task.SubmitDir, err = util.GetSubmitDir() + if err != nil { + return err + }internal/crun/crun.go (1)
1117-1120: Consider extracting duplicated SubmitDir logic into a helper function.This 4-line block is duplicated identically in
internal/calloc/calloc.go(lines 476-479) andinternal/cbatch/cmd.go(lines 108-111). Extract it into a helper function to reduce duplication and improve maintainability.🔎 Proposed refactor
Add a helper function to
internal/util:// GetSubmitDir retrieves the current working directory for task submission. func GetSubmitDir() (string, error) { dir, err := os.Getwd() if err != nil { return "", WrapCraneErr(ErrorSystem, "Failed to get current working directory: %s", err) } return dir, nil }Then use it:
- task.SubmitDir, err = os.Getwd() - if err != nil { - return util.WrapCraneErr(util.ErrorSystem, "Get submit dir err: %s.", err) - } + task.SubmitDir, err = util.GetSubmitDir() + if err != nil { + return err + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
internal/calloc/calloc.gointernal/cbatch/cmd.gointernal/crun/crun.goprotos/PublicDefs.proto
🧰 Additional context used
🧬 Code graph analysis (3)
internal/crun/crun.go (1)
internal/util/err.go (2)
WrapCraneErr(67-73)ErrorSystem(40-40)
internal/calloc/calloc.go (1)
internal/util/err.go (2)
WrapCraneErr(67-73)ErrorSystem(40-40)
internal/cbatch/cmd.go (1)
internal/util/err.go (2)
WrapCraneErr(67-73)ErrorSystem(40-40)
🪛 GitHub Actions: Go Code Quality Check
internal/crun/crun.go
[error] 1312-1312: go vet ./... reported: undefined: task
🔇 Additional comments (2)
protos/PublicDefs.proto (2)
183-184: LGTM! New submission metadata fields added to TaskToCtld.The additions of
submit_hostnameandsubmit_dirfields properly capture task submission metadata, using appropriate field numbers (42, 43) that don't conflict with existing fields.
339-341: LGTM! New submission metadata fields added to StepToD.The additions of
submit_hostname,total_gpus, andsubmit_dirfields properly capture step submission metadata, using appropriate field numbers (26, 27, 28) that don't conflict with existing fields.
| if task.Cwd == "" { | ||
| task.Cwd, _ = os.Getwd() | ||
| } | ||
| task.SubmitDir, err = os.Getwd() | ||
| if err != nil { | ||
| return util.WrapCraneErr(util.ErrorSystem, "Get submit dir err: %s.", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant os.Getwd() call.
os.Getwd() is called twice in succession (lines 106 and 108). The first call assigns to task.Cwd and ignores the error, while the second call assigns to task.SubmitDir with proper error handling. Consolidate these calls to avoid the redundant system call.
🔎 Proposed fix
- if task.Cwd == "" {
- task.Cwd, _ = os.Getwd()
- }
- task.SubmitDir, err = os.Getwd()
- if err != nil {
- return util.WrapCraneErr(util.ErrorSystem, "Get submit dir err: %s.", err)
- }
+ currentDir, err := os.Getwd()
+ if err != nil {
+ return util.WrapCraneErr(util.ErrorSystem, "Failed to get current working directory: %s", err)
+ }
+ if task.Cwd == "" {
+ task.Cwd = currentDir
+ }
+ task.SubmitDir = currentDirAdditionally, the same duplication concern mentioned in internal/calloc/calloc.go applies here.
🤖 Prompt for AI Agents
In internal/cbatch/cmd.go around lines 105-111 you call os.Getwd() twice;
consolidate into a single call that captures (cwd, err) once, handle err and
return the wrapped error if non-nil, then set task.SubmitDir = cwd and only set
task.Cwd = cwd if task.Cwd == ""; apply the same single-call pattern used here
to the similar duplication in internal/calloc/calloc.go.
78b233d to
5d6f287
Compare
5d6f287 to
bdbca68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/crun/crun.go (1)
1132-1136: ReusegVars.cwdto avoid redundant system call.The working directory was already obtained at line 1041 and stored in
gVars.cwd. Callingos.Getwd()again is redundant and adds unnecessary overhead.Additionally, the local variable name
SubmitDirshould follow Go's camelCase convention.🔎 Proposed simplification
- SubmitDir, err := os.Getwd() - if err != nil { - return util.WrapCraneErr(util.ErrorSystem, "Get submit dir err: %s.", err) - } - job.SubmitDir = SubmitDir + job.SubmitDir = gVars.cwd
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
internal/calloc/calloc.gointernal/cbatch/cbatch.gointernal/cbatch/cmd.gointernal/crun/crun.goprotos/PublicDefs.proto
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/cbatch/cbatch.go
- protos/PublicDefs.proto
- internal/calloc/calloc.go
- internal/cbatch/cmd.go
🧰 Additional context used
🧬 Code graph analysis (1)
internal/crun/crun.go (1)
internal/util/err.go (2)
WrapCraneErr(67-73)ErrorSystem(40-40)
🔇 Additional comments (2)
internal/crun/crun.go (2)
30-30: LGTM - Cobra import is used.The Cobra import is properly utilized in the
MainCrunfunction signature at line 1035.
1137-1141: LGTM - Hostname capture is correctly implemented.The hostname is properly captured with appropriate error handling. The variable naming follows Go conventions and the error wrapping uses the correct error code.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.