-
Notifications
You must be signed in to change notification settings - Fork 32
feat: Batch input/output/error option with file pattern #747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds IoMeta and LicenseInfo to protobufs; renames TaskStatus value Configured→Starting; propagates io_meta/sh_script and per-task resource maps through Ctld/Craned flows; replaces SupervisorKeeper with SupervisorStub and introduces StepInstance and per-task supervisor/cgroup/I/O lifecycles; refactors cgroup ID parsing and adds util GetHostname. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client as Client/API
participant Scheduler as TaskScheduler
participant Ctld as Ctld
participant Craned as Craned
participant SupervisorStub as SupervisorStub
participant Supervisor as Supervisor/TaskManager
Note over Client,Scheduler: Submit step/task (includes io_meta, sh_script)
Client->>Scheduler: SubmitStep(StepToCtld with io_meta, sh_script)
Scheduler->>Scheduler: HandleUnsetOptionalInStepToCtld (default IO fields)
Scheduler->>Ctld: AcquireStepAttributes / Queue step
Ctld->>Craned: StepToD (includes io_meta, sh_script, task_res_map)
Craned->>SupervisorStub: InitAndGetRecoveredMap / Create SupervisorStub
Craned->>SupervisorStub: Spawn/ExecuteStep (per-step)
SupervisorStub->>Supervisor: ExecuteStep → spawn per-task instances
Supervisor->>CforedClient: Setup I/O/X11 forwarding (local IDs, FwdRequest queue)
Supervisor->>CgroupManager: Allocate per-step / per-task cgroups
Supervisor->>Ctld: StepStatusChangeAsync (final status/exit when ready)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~180 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/Utilities/PublicHeader/OS.cpp (1)
277-285: Consider refactoring GetNodeInfo to use this function.The
GetNodeInfofunction (lines 36-42) contains duplicategethostname()logic. You could refactor it to callGetHostname()instead:bool GetNodeInfo(NodeSpecInfo* info) { if (!info) return false; - char hostname[HOST_NAME_MAX + 1]; - if (gethostname(hostname, sizeof(hostname)) != 0) { - int err = errno; - fmt::print(stderr, "gethostname failed: errno={} ({})\n", err, - strerror(err)); + auto hostname_result = GetHostname(); + if (!hostname_result) { return false; } + std::string hostname = std::move(*hostname_result);This would centralize the hostname retrieval logic and improve consistency. Based on learnings, you may prefer to defer this until more usage patterns emerge.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.cpp(4 hunks)src/CraneCtld/CtldPublicDefs.h(2 hunks)src/CraneCtld/DbClient.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(3 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(4 hunks)src/Craned/Supervisor/TaskManager.h(2 hunks)src/Utilities/PublicHeader/OS.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/OS.h(1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/Utilities/PublicHeader/OS.cpp
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-15T02:17:20.669Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 477
File: src/Craned/TaskManager.cpp:781-798
Timestamp: 2025-04-15T02:17:20.669Z
Learning: The user prefers to implement separate stderr pattern handling for crun tasks only when there's a specific requirement for it, rather than implementing it proactively. The current implementation only supports redirecting stdout to a file with the `--output` option.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (5)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
GetHostname(277-285)GetHostname(277-277)
src/CraneCtld/TaskScheduler.h (2)
src/CraneCtld/TaskScheduler.cpp (2)
HandleUnsetOptionalInStepToCtld(4276-4284)HandleUnsetOptionalInStepToCtld(4276-4277)src/CraneCtld/CtldPublicDefs.h (2)
StepInCtld(647-671)StepInCtld(673-1098)
src/CraneCtld/CtldPublicDefs.h (1)
src/Craned/Supervisor/TaskManager.cpp (6)
IsBatch(74-76)IsBatch(74-74)IsCalloc(87-90)IsCalloc(87-87)IsCrun(82-85)IsCrun(82-82)
src/CraneCtld/TaskScheduler.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(160-160)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (2)
GetHostname(277-285)GetHostname(277-277)src/CraneCtld/CtldPublicDefs.cpp (1)
result(66-119)
🪛 Cppcheck (2.18.0)
src/Craned/Supervisor/TaskManager.cpp
[warning] 330-330: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.
(mismatchingContainerExpression)
🔇 Additional comments (31)
src/Utilities/PublicHeader/include/crane/OS.h (1)
84-85: LGTM!The function declaration is well-placed and uses the modern
CraneExpectedpattern for error handling.src/Utilities/PublicHeader/OS.cpp (1)
277-285: Implementation is correct and well-structured.The function properly handles errors using
CraneExpected, uses an appropriately sized buffer, and follows the logging conventions of the codebase.src/CraneCtld/DbClient.cpp (4)
1883-1883: LGTM!The script field now correctly sources from the unified
task_to_ctld.sh_script()location, consistent with the PR's goal to centralize IO-related metadata.
1911-1912: LGTM!The type check now correctly extracts container metadata only for Container tasks. The script is sourced from the unified location on line 1908, and the Batch-specific path has been appropriately removed as part of the consolidation.
2006-2006: LGTM!The script field now sources from the unified
step->StepToCtld().sh_script()location, eliminating the conditional Batch-specific path and aligning with the centralized IO metadata design.
2076-2076: LGTM!The script field now sources from the unified
step_to_ctld.sh_script()location, consistent with the refactoring to centralize script content and IO metadata.src/CraneCtld/TaskScheduler.h (1)
899-899: LGTM! Consistent addition for step-level IO metadata handling.The new
HandleUnsetOptionalInStepToCtldmethod mirrors the existingHandleUnsetOptionalInTaskToCtldpattern and properly handles default values for optional IO metadata fields in steps.src/CraneCtld/CtldPublicDefs.cpp (3)
399-400: LGTM! Clear documentation of daemon step behavior.The comments correctly document that daemon steps don't require batch/interactive/IO metadata or script content, which helps maintain clarity in the step initialization logic.
682-685: LGTM! Correct IO metadata and script propagation.The code properly checks for
has_io_meta()before copying, and unconditionally copiessh_script(which is safe for string fields). This ensures IO configuration and script content are correctly propagated from job to primary step.
806-810: LGTM! Consistent propagation to StepToD.The IO metadata and script handling mirrors the pattern in
InitPrimaryStepFromJob(lines 682-685), maintaining consistency in how these fields are transferred from StepToCtld to StepToD.src/CraneCtld/CtldPublicDefs.h (2)
550-560: LGTM! Useful helper methods following established patterns.The new
IsBatch(),IsCalloc(), andIsCrun()methods provide convenient type checking for steps and follow the same pattern as the existingTaskInCtldhelpers (lines 828-834). The logic correctly identifies step types by checking both thetypefield and theinteractive_typesubfield where applicable.Note: Similar helpers exist in
StepInstance(src/Craned/Supervisor/TaskManager.cpp lines 73-89), creating some code duplication. Based on learnings, you prefer to defer refactoring duplicates until they become a larger pattern.
835-839: LGTM! Completes the helper method set for TaskInCtld.The
IsCrun()addition toTaskInCtldcomplements the existingIsCalloc()method and follows the same implementation pattern. This provides symmetry with the step-level helpers added above.src/CraneCtld/TaskScheduler.cpp (3)
2828-2836: LGTM! Proper integration of the new handler with error propagation.The chained validation flow correctly:
- Initializes optional IO fields via
HandleUnsetOptionalInStepToCtld- Acquires step attributes
- Checks step validity
- Propagates the first error encountered via
std::unexpected
4078-4087: LGTM! Migration from batch_meta to io_meta for batch tasks.The refactoring correctly moves the
open_mode_appenddefault initialization frombatch_metato the newio_metastructure, aligning with the centralized IO metadata pattern.
4276-4284: Verify the conditional logic difference between task and step handlers.The task handler checks
task->IsBatch()before setting IO defaults, but the step handler checksstep->StepToCtld().has_io_meta(). This means:
- Tasks: Defaults are set for all batch tasks (even if io_meta is empty)
- Steps: Defaults are only set if io_meta is explicitly provided
Is this intentional? If a batch step doesn't explicitly provide
io_meta, it won't get theopen_mode_appenddefault applied.Consider whether the step handler should also check the step type (e.g.,
step->IsBatch()if such a method exists) for consistency:CraneExpected<void> TaskScheduler::HandleUnsetOptionalInStepToCtld( StepInCtld* step) { - if (step->StepToCtld().has_io_meta()) { + if (step->IsBatch() && step->StepToCtld().has_io_meta()) { auto* io_meta = step->MutableStepToCtld()->mutable_io_meta(); if (!io_meta->has_open_mode_append()) io_meta->set_open_mode_append(g_config.JobFileOpenModeAppend); } return {}; }protos/PublicDefs.proto (5)
162-163: LGTM! IO metadata centralization at the task level.Moving
io_metaandsh_scripttoTaskToCtldallows unified access to IO configuration regardless of task type, while still maintaining the type-specific payload via theoneoffield.
242-244: LGTM! Consistent structure with TaskToCtld.The step message mirrors the task message structure for IO metadata, enabling consistent handling across task and step submission paths.
324-325: LGTM! IO metadata propagated to daemon communication.Adding
io_metaandsh_scripttoStepToDensures the IO configuration reaches the craned for execution.
360-366: LGTM! Well-structured IO metadata message.The
TaskIoMetamessage appropriately:
- Uses
optionalforopen_mode_appendto enable presence tracking- Provides fields for input/output/error file patterns as indicated by the PR title
368-370: Verify deployment compatibility for this breaking change.
BatchTaskAdditionalMetahas been significantly restructured:
- Removed:
sh_script,open_mode_append(moved toTaskIoMeta/parent)- Changed:
interpreteris now field 1Ensure all components (CraneCtld, Craned, clients) are deployed atomically. Rolling deployments with mixed versions could cause data loss or parsing failures for batch task metadata.
src/Craned/Supervisor/TaskManager.h (3)
173-175: LGTM: IO pattern fields unified in TaskInstanceMeta.Moving these fields from BatchInstanceMeta to TaskInstanceMeta aligns with the PR's goal of centralizing IO-related metadata. This enables IO pattern handling for both batch and interactive tasks.
178-180: LGTM: BatchInstanceMeta correctly inherits IO pattern fields.The inheritance structure ensures BatchInstanceMeta retains access to the IO pattern fields now defined in the parent TaskInstanceMeta.
377-379: LGTM: ParseFilePathPattern_ signature enhanced with is_stdout flag.The additional
is_stdoutparameter enables differentiated handling for stdout patterns (which get defaults) versus stderr/stdin patterns (which remain empty when unspecified).src/Craned/Supervisor/TaskManager.cpp (8)
53-53: LGTM: Unified script access simplifies code.Using the unified
sh_script()field removes the need for type-specific branching, aligning with the PR's metadata reorganization.
256-269: LGTM: Default path logic correctly differentiates stdout from other IO patterns.When the pattern is empty, stdout receives a default path (
Crane-<JobID>.<StepID>.outin cwd), while stdin/stderr remain empty. This aligns with common batch system behavior.
271-275: Clarify the backslash escape handling.The code removes all backslashes from the pattern and returns immediately. This appears to be an escape mechanism, but it's unclear if this is the intended behavior:
- Should
\\become an empty string?- Should this support escaping specific characters like
\%to produce a literal%?Can you clarify the intended behavior of backslash handling in file patterns? The current implementation removes all backslashes, which may not provide the flexibility users expect for escaping special characters.
318-367: Address iterator mismatch warning from static analysis.The static analysis tool flags line 330 for using iterators from different expressions (
inputandgroups[0]). While this code likely works correctly with RE2's API, the warning suggests potential undefined behavior.Verify that the RE2 library guarantees
groups[0]iterators are valid sub-ranges ofinput. The code assumesgroups[0].begin()points into the same string asinput.begin(), which should be documented in RE2's API contract.As per static analysis hints from Cppcheck.
571-588: LGTM: Stdin handling properly implemented.The code correctly handles two cases:
- Empty stdin pattern: Closes STDIN_FILENO for batch tasks to prevent blocking on input
- Non-empty pattern: Opens the specified input file and redirects it to STDIN_FILENO
The logging provides clear visibility into stdin configuration.
593-621: LGTM: Enhanced stdout/stderr logging improves debuggability.The task-specific logging messages clearly indicate which file is being used for stdout and stderr redirection, making troubleshooting easier.
1264-1279: LGTM: IO pattern parsing correctly uses unified io_meta() and is_stdout flag.The code properly:
- Uses
io_meta()for all three patterns (output, error, input) instead of type-specific metadata- Sets
is_stdout=trueonly for output_file_pattern to enable default path generation- Sets
is_stdout=falsefor error and input patterns, which remain empty when unspecified- Conditionally parses error and input patterns only when non-empty
277-290: GetHostname() implementation has adequate error handling.The
util::os::GetHostname()function properly handles edge cases: it uses a correctly-sized buffer (HOST_NAME_MAX + 1), checks the gethostname() return value, logs errors with errno details, and returns a CraneExpected type that signals failures. When hostname resolution fails in TaskManager.cpp, the code correctly checks for the error condition withif (!hostname_expt)and falls back to returning the pattern unmodified. This is intentional design—the function does not leave the system in an undefined state. While placeholders like%Nmay remain unreplaced on hostname resolution failure, this is a known limitation of the current approach rather than a defect in the GetHostname() implementation itself.
| auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode); | ||
| if (node_it == grpc_nodelist.end()) { | ||
| CRANE_ERROR("Failed to find current node {} in nodelist", | ||
| g_config.CranedIdOfThisNode); | ||
| return path_to_parse; | ||
| } | ||
| int node_index = std::distance(grpc_nodelist.begin(), node_it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing node in nodelist gracefully.
If the current node is not found in the nodelist (line 285), an error is logged but the function returns path_to_parse with unreplaced %n placeholders. This could lead to file paths with literal %n in them.
Consider handling this case more gracefully:
auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
if (node_it == grpc_nodelist.end()) {
CRANE_ERROR("Failed to find current node {} in nodelist",
g_config.CranedIdOfThisNode);
- return path_to_parse;
+ // Continue with node_index = 0 as fallback
+ node_index = 0;
+ } else {
+ node_index = std::distance(grpc_nodelist.begin(), node_it);
}
- int node_index = std::distance(grpc_nodelist.begin(), node_it);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode); | |
| if (node_it == grpc_nodelist.end()) { | |
| CRANE_ERROR("Failed to find current node {} in nodelist", | |
| g_config.CranedIdOfThisNode); | |
| return path_to_parse; | |
| } | |
| int node_index = std::distance(grpc_nodelist.begin(), node_it); | |
| auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode); | |
| int node_index; | |
| if (node_it == grpc_nodelist.end()) { | |
| CRANE_ERROR("Failed to find current node {} in nodelist", | |
| g_config.CranedIdOfThisNode); | |
| // Continue with node_index = 0 as fallback | |
| node_index = 0; | |
| } else { | |
| node_index = std::distance(grpc_nodelist.begin(), node_it); | |
| } |
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 284 to 290, when the
current node is not found the function logs an error and returns path_to_parse
with unreplaced %n placeholders; instead, log a warning about the missing node,
set a safe fallback node_index (e.g. 0) to use for placeholder replacement, and
proceed with the normal %n replacement so the returned path contains no literal
"%n" (ensure the warning includes the missing node id and that we continue
rather than returning early).
Signed-off-by: Li Junlin <[email protected]> fix: compile Signed-off-by: Li Junlin <[email protected]> fix: AI comments Signed-off-by: Li Junlin <[email protected]> fix Signed-off-by: Li Junlin <[email protected]> fix: Comments Signed-off-by: Li Junlin <[email protected]> style: auto format with clang-format. bugfix X11 multi conn Signed-off-by: Li Junlin <[email protected]> fix: X11 multi conn Signed-off-by: Li Junlin <[email protected]> WIP: X11 Signed-off-by: Li Junlin <[email protected]> fix: X11 forwarding Signed-off-by: Li Junlin <[email protected]> fix: Subprocess cg Signed-off-by: Li Junlin <[email protected]> fix: Only init cfored for ia task Signed-off-by: Li Junlin <[email protected]> fix: Exit reason Signed-off-by: Li Junlin <[email protected]> fix: Deadlock Signed-off-by: Li Junlin <[email protected]> style: auto format with clang-format. fix Signed-off-by: Li Junlin <[email protected]> fix: tasks step exit code Signed-off-by: Li Junlin <[email protected]> fix: Share mem for tasks in a same step on node Signed-off-by: Li Junlin <[email protected]> feat: Support task execution Signed-off-by: Li Junlin <[email protected]> fix: Cgroup recovery Signed-off-by: Li Junlin <[email protected]> refactor: Use std::chrono_literals Signed-off-by: Li Junlin <[email protected]> refactor Signed-off-by: Li Junlin <[email protected]> refactor: PAM Signed-off-by: Li Junlin <[email protected]> move user cgroup of daemon step to Supervisor Signed-off-by: Li Junlin <[email protected]> Update TaskScheduler.cpp feat: Step cgroup recovery Signed-off-by: Li Junlin <[email protected]> refactor: Supervisor step status. Signed-off-by: Li Junlin <[email protected]> refactor: Step instance Signed-off-by: Li Junlin <[email protected]> feat: Add CPU binding functionality for cgroup v1 and v2 Signed-off-by: Li Junlin <[email protected]>
Signed-off-by: junlinli <[email protected]>
Signed-off-by: junlinli <[email protected]>
Signed-off-by: Li Junlin <[email protected]> fix: submit time,start time, supervisor coredump Signed-off-by: Li Junlin <[email protected]> Signed-off-by: junlinli <[email protected]>
Signed-off-by: junlinli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
src/CraneCtld/TaskScheduler.cpp (1)
1070-1092: Fix capture-by-reference bug in AllocJobs lambdaIn the
craned_alloc_job_maploop, the lambda passed tom_rpc_worker_pool_->detach_taskcaptures by reference ([&]) and usescraned_idandjobs, which are loop-local variables:for (auto&& iter : craned_alloc_job_map) { CranedId const& craned_id = iter.first; std::vector<crane::grpc::JobToD>& jobs = iter.second; m_rpc_worker_pool_->detach_task([&] { auto stub = g_craned_keeper->GetCranedStub(craned_id); // uses jobs... }); }All detached tasks will see the last values of
craned_idandjobs, so all RPCs and failure bookkeeping target the same node and job set, which is a correctness and scheduling bug.Capture
craned_idandjobsby value in the lambda instead:Proposed fix
- std::latch alloc_job_latch(craned_alloc_job_map.size()); + std::latch alloc_job_latch(craned_alloc_job_map.size()); for (auto&& iter : craned_alloc_job_map) { CranedId const& craned_id = iter.first; std::vector<crane::grpc::JobToD>& jobs = iter.second; - m_rpc_worker_pool_->detach_task([&] { + m_rpc_worker_pool_->detach_task([&, craned_id, jobs] { auto stub = g_craned_keeper->GetCranedStub(craned_id); CRANE_TRACE("Send AllocJobs for {} tasks to {}", jobs.size(), craned_id); if (stub == nullptr || stub->Invalid()) { CRANE_TRACE( "AllocJobs for jobs [{}] to {} failed: Craned down.", absl::StrJoin( jobs | std::views::transform( [](const crane::grpc::JobToD& job_to_d) { return std::to_string(job_to_d.job_id()); }), ","), craned_id); absl::MutexLock lk(&thread_pool_mtx); for (const auto& job_to_d : jobs) failed_job_id_set.emplace(job_to_d.job_id()); alloc_job_latch.count_down(); return; } // ... rest unchanged ... }); }Also applies to: 1094-1152, 1153-1157, 1159-1210
src/Craned/Supervisor/TaskManager.cpp (1)
725-752: Closestdin_fdafterdup2in batch FD setupIn
ProcInstance::SetChildProcBatchFd_, whenstdin_file_pathis non-empty you open the input file,dup2it ontoSTDIN_FILENO, but never closestdin_fd. Since this happens in the child beforeexecv, that extra FD will remain open acrossexec, which is unnecessary and slightly increases per-process FD usage.After a successful
dup2(stdin_fd, STDIN_FILENO), it’s safe toclose(stdin_fd).Proposed minimal change
int stdin_fd = open(stdin_file_path.c_str(), O_RDONLY); if (stdin_fd == -1) { fmt::println(stderr, "[Task #{}] Subprocess error: open stdin: {}. {}\n", task_id, stdin_file_path, strerror(errno)); return CraneErrCode::ERR_SYSTEM_ERR; } fmt::println(stderr, "[Task #{}] Subprocess stdin: {}", task_id, stdin_file_path); // STDIN -> input file - if (dup2(stdin_fd, STDIN_FILENO) == -1) std::exit(1); + if (dup2(stdin_fd, STDIN_FILENO) == -1) std::exit(1); + close(stdin_fd);src/Craned/Core/CranedForPamServer.cpp (1)
242-249: Ensure PAM Unix socket permissions are actually tightened
chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600);is currently called beforebuilder.BuildAndStart(), i.e., likely before gRPC creates the Unix socket file. In that case,chmodwill fail and the socket keeps its default permissions, but the error is ignored.To guarantee restrictive permissions:
- Move the
chmodcall afterm_server_ = builder.BuildAndStart();, and- Check the return value and log on failure.
This keeps the PAM socket properly locked down to root as intended by the comment.
Illustrative adjustment
builder.RegisterService(m_service_impl_.get()); - // Pam run as root - chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600); - - m_server_ = builder.BuildAndStart(); + m_server_ = builder.BuildAndStart(); + // Pam runs as root; restrict Unix socket permissions. + if (chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600) != 0) { + CRANE_WARN("Failed to chmod PAM Unix socket {}: {}", + listen_conf.UnixSocketForPamListenAddr, strerror(errno)); + }src/CraneCtld/CtldPublicDefs.cpp (1)
467-523: Daemon step status change: interactive-meta handling and primary-step fallbackThe reworked
DaemonStepInCtld::StepStatusChangenicely covers:
- Configuring→Running transition with creation of a primary step and population of
craned_step_alloc_map.- Configure-failed path that frees daemon resources and now returns a
(status, exit_code)pair even when the primary step was never created.- Interactive jobs: invoking
cb_step_cancelandcb_step_completedwhen daemon configuration fails.Two small points to consider:
- In the interactive configure-failed branch you copy
InteractiveMetaout ofjob->meta(auto meta = std::get<InteractiveMeta>(job->meta);). If any later code relies onhas_been_cancelled_on_front_endinjob->meta, those flag updates won’t persist. If you need the flag later, switch this to a reference (auto& meta = std::get<InteractiveMeta>(job->meta);).- When the daemon finishes successfully but
PrimaryStepStatus()is stillInvalid, you now return the daemon’s own status/exit code, which is a sensible fallback for “daemon-only” jobs; just ensure callers tolerate both primary and daemon-originated pairs.Also applies to: 547-600
src/Craned/Supervisor/CforedClient.cpp (1)
117-197: X11 connection bookkeeping: consider attaching X11FdInfo to the handle and avoid double-closingTwo small X11-related issues:
- In
SetupX11forwarding_, you constructx11_fd_infoand store it inm_x11_fd_info_map_, butsock->data<X11FdInfo>()is never initialized. Theend_event/error_eventcallbacks therefore see a nulldata<X11FdInfo>()and never actually setsock_stopped. Either dropsock_stoppedentirely or callsock->data(x11_fd_info);when accepting the connection so the callbacks operate on real state.- In the TASK_X11_INPUT handling path, you pass
eofasclose_fdtoWriteStringToFd_and also callx11_fd_info->sock->close()on EOF. That results in the underlying socket fd being closed twice (once byclose(fd), once by libuv/uvw). It would be safer to passclose_fd = falsefor X11 and rely solely onsock->close()for lifetime management.Also applies to: 724-783
src/Craned/Common/CgroupManager.cpp (2)
494-552: Fix file descriptor leak and complete BPF recovery logic in RecoverFromResInNodeRecoverFromResInNode has two issues:
File descriptor leak: Opens
cgroup_fdat line 1468 but never closes it before returning at line 1502. Compare withSetDeviceAccess(line 1376) which properly callsclose(cgroup_fd)before return.Incomplete recovery: Only populates
m_cgroup_bpf_devicesand setsm_bpf_attached_ = true, but doesn't actually restore device filters to the eBPF map or attach the program. UnlikeSetDeviceAccess, it skips thebpf_map__update_elemandbpf_prog_attachcalls, meaning the recovered cgroup has no active device restrictions.Add
close(cgroup_fd);before the finalreturn true;, and consider mirroring the fullSetDeviceAccesslogic (updating the BPF map and conditionally attaching the program) to ensure recovery actually restores device access controls.Point 2 about
GetJobBpfMapCgroupsV2_usingCgroupStrParsedIdskeys is correct and theCRANE_ASSERTis appropriate.
279-308: Fix RE2 pattern matching to use plain types instead of std::optional pointers
RE2::FullMatchdoes not supportstd::optional<T>*arguments—it expects plain types or pointers to plain types. InParseIdsFromCgroupStr_, capture into plain temporary variables (std::string,uint32_t) instead, then assign into the optionals after a successful match:- std::optional<job_id_t> job_id; - std::optional<step_id_t> step_id; - std::optional<task_id_t> task_id; - std::optional<std::string> system_or_user; + std::string job_id_str, step_id_str, system_or_user_str, task_id_str; CgroupStrParsedIds parsed_ids{}; - if (RE2::FullMatch(cgroup_str, *cg_pattern, &job_id, &step_id, - &system_or_user, &task_id)) { - std::get<CgConstant::KParsedJobIdIdx>(parsed_ids) = job_id; - std::get<CgConstant::KParsedStepIdIdx>(parsed_ids) = step_id; - std::get<CgConstant::KParsedSystemFlagIdx>(parsed_ids) = - system_or_user.has_value() && system_or_user.value() == "system"; - std::get<CgConstant::KParsedTaskIdIdx>(parsed_ids) = task_id; + if (RE2::FullMatch(cgroup_str, *cg_pattern, &job_id_str, &step_id_str, + &system_or_user_str, &task_id_str)) { + if (!job_id_str.empty()) { + auto& job_id_opt = std::get<CgConstant::KParsedJobIdIdx>(parsed_ids); + absl::SimpleAtoi(job_id_str, &job_id_opt.emplace()); + } + if (!step_id_str.empty()) { + auto& step_id_opt = std::get<CgConstant::KParsedStepIdIdx>(parsed_ids); + absl::SimpleAtoi(step_id_str, &step_id_opt.emplace()); + } + if (!task_id_str.empty()) { + auto& task_id_opt = std::get<CgConstant::KParsedTaskIdIdx>(parsed_ids); + absl::SimpleAtoi(task_id_str, &task_id_opt.emplace()); + } + if (!system_or_user_str.empty()) + std::get<CgConstant::KParsedSystemFlagIdx>(parsed_ids) = + (system_or_user_str == "system");This ensures parsed IDs are correctly populated before being used in
GetIdsFromCgroupV1_,GetIdsFromCgroupV2_,GetCgInoJobIdMapCgroupV2_,GetIdsByPid, andGetJobBpfMapCgroupsV2_.
♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
457-571: Pattern parsing behavior for missing node index and cppcheck warningTwo points in
ProcInstance::ParseFilePathPattern_:
When the current node is not found in
nodelist, the function logs an error and returnspath_to_parseunchanged. This leaves%n(and potentially other%specifiers) in the final path. If this is not intended, you could fall back to a sane default index (e.g. 0) and still perform replacement so the returned path is always fully expanded. This matches the prior suggestion in earlier review feedback.Cppcheck’s mismatchingContainerExpression warning on the
std::distance(input.begin() + cursor, groups[0].begin())line is a false positive in this RE2/StringPiece context, but if you want to silence it you could rewrite the prefix-length computation using pointer arithmetic oninput.data()/groups[0].data()instead of iterators, which keeps the intent but avoids mixing iterator sources.
🧹 Nitpick comments (19)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
543-549: VariantVisitor utility is appropriate for centralizing std::variant visitationThe
VariantVisitorhelper and its deduction guide are idiomatic and let call sites (e.g., inTaskScheduler.cpp) compose visitors without local boilerplate. No issues spotted with this addition.src/CraneCtld/TaskScheduler.h (1)
928-928: Signature change to CommonStepInCtld matches implementation usage*Updating
TerminateRunningStepNoLock_to acceptCommonStepInCtld*(rather thanStepInCtld*) is consistent with the implementation, which relies onCommonStepInCtld-specific fields for interactive steps, and matches call sites that only pass common steps.src/CraneCtld/TaskScheduler.cpp (3)
1668-1693: Interactive-step termination guard and queueing logic look correct
TerminateRunningStepNoLock_correctly avoids enqueueing duplicate cancel requests for interactive steps by togglingia_meta.has_been_terminated_on_craned, while always schedulingCancelRunningTaskQueueElemfor non-interactive steps across all execution nodes. The function assumes callers hold the appropriate locks (perNoLock_naming), which is satisfied in the current call sites.
2552-2563: VariantVisitor-based std::visit cleans up cancellation dispatchUsing the shared
VariantVisitorhelper instd::visitforCancelTaskQueueElemnicely centralizes handling of pending tasks, running tasks, and pending steps, and keeps ownership transfer for each case clear (pending_task_ptr_vec,running_task_craned_id_map,pending_step_ptr_vec). No functional issues spotted here.
2945-2951: Use of StepStatusChange return value and added trace improves observabilityHandling the new
StepStatusChangereturn type as an optional{TaskStatus, exit_code}pair and loggingCRANE_TRACE("[Job #{}] Completed with status {}.", task_id, job_finished_status.value());before applying status/exit code to the job improves traceability of job completion. The subsequent end-time adjustment logic remains unchanged and still guards against large drifts beyond the time limit.
src/Utilities/PublicHeader/include/crane/String.h (1)
149-167: Template constraints for StepToD and JobStepsToString are reasonableConfining
StepToDRangeIdStringtostd::ranges::range<R>whererange_value_t<R> == crane::grpc::StepToD, and introducingMapRangeplus additional constraints forJobStepsToString, makes these helpers harder to misuse and yields clearer compile-time errors when called with incompatible containers, while preserving existing call patterns in the codebase.src/Craned/Core/SupervisorStub.cpp (1)
35-92: Supervisor recovery via socket directory scan is well-structured
InitAndGetRecoveredMap():
- Filters
step_*.sockentries underkDefaultSupervisorUnixSockDirviaLazyRE2.- Spawns one task per socket on
g_thread_pool, each creating aSupervisorStub, callingCheckStatus(), and either removing dead sockets or inserting{(job_id, step_id) → {pid, status, stub}}into the map under a mutex.- Uses a
std::latchsized tofiles.size()so the function only returns once all probes have completed.This gives a clear, bounded recovery pass with appropriate logging and cleanup; the function’s
CraneExpectederror path only triggers on unexpected filesystem exceptions.src/Craned/Common/CgroupManager.h (1)
53-62: Cpuset controller/file additions and SetCpuBind hooks are coherentAdding
CPUSET_CONTROLLER/CPUSET_CONTROLLER_V2and their correspondingCPUSET_CPUS/CPUSET_CPUS_V2entries toController/ControllerFileand the string tables keeps all enum→string mappings in sync. The new pure-virtualSetCpuBindinCgroupInterfacewith overrides in bothCgroupV1andCgroupV2provides a clear extension point for cpuset binding logic without impacting existing memory/CPU/IO methods.Also applies to: 78-79, 90-91, 142-149, 167-180, 397-399, 431-432, 459-460
src/Craned/Core/CranedServer.cpp (1)
192-193: Unusedsystem_flagvariable.The
system_flagis extracted from the tuple but never used in this function. If it's intentionally unused, consider marking it with[[maybe_unused]]or using a discard pattern.🔎 Proposed fix
- auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = + auto [job_id_opt, step_id_opt, [[maybe_unused]] system_flag, task_id_opt] = pid_to_ids_expt.value();src/Craned/Core/StepInstance.cpp (1)
251-260: Child process: Consider using_exit()instead ofabort()for cgroup migration failure.Using
abort()generates a core dump and SIGABRT, which may be noisy for a controlled failure. Consider using_exit(EXIT_FAILURE)for cleaner shutdown, thoughabort()aligns with the error handling pattern at line 332.src/Craned/Core/StepInstance.h (1)
76-78: TerminateStep is declared but marked as not implemented.Consider adding a TODO comment or removing this declaration if it's not needed for this PR. Having an unimplemented method in the public interface may cause confusion.
Would you like me to open an issue to track the implementation of
TerminateStep, or should this declaration be removed for now?src/CraneCtld/CtldPublicDefs.cpp (2)
155-159: Vector-based CranedIds and node-set reconstruction are coherent with runtime_attr
StepInCtld::SetCranedIdsand the updatedRecoverFromDb/InitFromJobpaths correctly switch tostd::vector<CranedId>for ordering while deriving execution/configuring/running node sets from that vector. The runtime attributes are kept in sync, so downstream queries viaRuntimeAttr().craned_ids()andExecutionNodes()stay consistent.If you ever need deterministic ordering on the node sets for presentation/logging, consider building them as
std::set<CranedId>and then materializing tounordered_setonly where necessary. Based on learnings, refactoring can be deferred until this pattern grows further.Also applies to: 233-288, 364-369
678-739: Common step per-task resource mapping and final status propagationThe new logic in
CommonStepInCtldlooks solid overall:
InitPrimaryStepFromJob:
- Initializes
craned_task_map/task_res_mapfor batch/calloc (single task per node) and crun (per-node ×ntasks_per_node), and zeros per-task memory to avoid per-task memory limiting while keeping step-level memory limits. This aligns with the later change inAllocatableResourceAllocatorto skip 0-valued memory limits.GetStepToD:
- Exposes
task_res_mapvia the newtask_res_mapfield and copies the appropriatebatch_meta/interactive_meta/container_meta, plusio_metaandsh_script, matching the updated proto surface.StepStatusChange:
- Correctly treats
Configuringas “Starting/Failed/Cancelled until all nodes configured” and then transitions toRunning.- Properly tears down orphaned steps and daemon steps on configure failure or completion.
- Uses
AllExecutionStepsFinished()to decide when to return a final(PrimaryStepStatus, PrimaryStepExitCode)pair for the whole task.One behavioral nuance: when a primary step finishes with mixed statuses across nodes, you ultimately set the step’s status/exit code from
PrevErrorStatus()/PrevErrorExitCode(), but you still propagateexit_code(from the last node event) intoSetPrimaryStepExitCode(exit_code). If you intend the task-level exit code to reflect the aggregated step status, it may be safer to usethis->ExitCode()instead. Please double-check which semantics you want at the task level.Also applies to: 794-861, 913-915, 995-1027, 1055-1058
protos/PublicDefs.proto (1)
109-122: IoMeta and new status/metadata fields align well; confirm TaskStatus numbering historyThe introduction of
IoMetaand its use acrossTaskToCtld,StepToCtld, andStepToD(plus moving script content intosh_script) is a clean factoring that matches the Ctld/Supervisor changes and the new per-tasktask_res_mapinStepToD. The newStarting = 8TaskStatus andLicenseInfomessage also look reasonable.One thing to double-check: ensure enum value
8was not previously used for a differentTaskStatusin older releases, so that persisted data or mixed-version components won’t misinterpret existing values asStarting.Also applies to: 135-186, 218-255, 311-350, 362-373, 382-395, 823-829
src/CraneCtld/CtldPublicDefs.h (2)
521-560: Step type predicates and vector-based CranedIds are consistent with Supervisor side
StepInCtld’s newIsBatch/IsCalloc/IsCrunhelpers mirror the Supervisor’sStepInstancelogic and correctly guardinteractive_meta()access behind atype == Interactivecheck. Movingm_craned_ids_to astd::vector<CranedId>and exposing it viaCranedIds()while keeping execution/configuring/running nodes asunordered_setstrikes a good balance between ordered lists (for display/regex) and fast membership checks.If you find yourself often needing just a single “primary” craned id, consider adding a small helper (e.g.
PrimaryCranedId()) instead of directly indexingCranedIds().front()in callers, to encode the invariant once. Based on learnings, this can wait until the pattern appears in more places.Also applies to: 583-597
671-687: Per-task resource bookkeeping in CommonStepInCtld and schedule pathThe new members
task_res_mapandcraned_task_mapinCommonStepInCtld, plus theSchedulePendingStepschanges, correctly:
- Track which local task IDs belong to which craned IDs.
- Build per-task
ResourceInNodeentries (with memory zeroed so that per-task cgroups don’t enforce their own memory limit).- Export this mapping via
StepToD::task_res_map.
AllExecutionStepsFinished()matchingm_steps_.empty() && !m_primary_step_is a simple and correct criterion for deciding when all execution steps are done at the task level.Down the line, you might consider wrapping the
task_id_tallocation (thecur_task_idloops) in a small helper to avoid duplicating the same pattern in bothInitPrimaryStepFromJobandSchedulePendingSteps.Also applies to: 829-845, 913-915, 953-1025
src/Craned/Supervisor/CforedClient.cpp (1)
406-469: Unified forwarding queue and AsyncSendRecvThread_ state machine look correctThe new
FwdRequest+m_task_fwd_req_queue_design inCleanOutputQueueAndWriteToStreamThread_cleanly centralizes all outgoingStreamTaskIORequestpayloads (stdout, X11 connect/output/EOF, exit status) and respects thewrite_pendingflag to avoid overlapping writes. TheAsyncSendRecvThread_changes (thread name, TIMEOUT handling, unregister flow, and use of the new X11 input fields) preserve the existing state machine while enabling the new message types.You could replace the manual
m_mtx_.Lock()/Unlock()in the forwarding state withabsl::MutexLockfor consistency with the rest of the file, but it’s not strictly necessary.Also applies to: 471-607
src/Craned/Supervisor/CforedClient.h (1)
28-37: Forwarding API and FwdRequest variant are well-structuredThe header cleanly exposes the new unified forwarding API (
TaskOutPutForward,TaskX11ConnectForward,TaskX11OutPutForward,TaskX11OutputFinish,TaskProcessStop(exit_code, signaled)) and backs it with a singleFwdRequestvariant/queue, which matches the implementation and proto types.The
x11_id_talias is currently unused within this class; if it’s not used elsewhere, you could remove it or add a brief comment about its intended external use.Also applies to: 56-58, 84-103, 106-114, 123-154, 160-167
src/Craned/Supervisor/TaskManager.h (1)
467-467: Minor grammar fix in comment.🔎 Proposed fix
- // Should called in uvw thread, otherwise data race may happen. + // Should be called in uvw thread, otherwise data race may happen.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (35)
protos/Crane.protoprotos/PublicDefs.protoprotos/Supervisor.protosrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/DbClient.cppsrc/CraneCtld/EmbeddedDbClient.cppsrc/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.hsrc/Craned/Common/CgroupManager.cppsrc/Craned/Common/CgroupManager.hsrc/Craned/Core/CMakeLists.txtsrc/Craned/Core/Craned.cppsrc/Craned/Core/CranedForPamServer.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Core/JobManager.hsrc/Craned/Core/StepInstance.cppsrc/Craned/Core/StepInstance.hsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/Supervisor.cppsrc/Craned/Supervisor/SupervisorPublicDefs.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Supervisor/SupervisorServer.hsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/Utilities/PublicHeader/OS.cppsrc/Utilities/PublicHeader/include/crane/OS.hsrc/Utilities/PublicHeader/include/crane/PublicHeader.hsrc/Utilities/PublicHeader/include/crane/String.h
💤 Files with no reviewable changes (1)
- src/Craned/Supervisor/SupervisorPublicDefs.h
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Utilities/PublicHeader/include/crane/OS.h
- src/CraneCtld/DbClient.cpp
🧰 Additional context used
🧠 Learnings (41)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Utilities/PublicHeader/OS.cppsrc/Craned/Core/CMakeLists.txtsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/JobManager.hsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/Craned/Core/StepInstance.hsrc/Craned/Core/CMakeLists.txtsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Core/CranedForPamServer.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/JobManager.hsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Core/StepInstance.hsrc/Craned/Supervisor/SupervisorServer.hsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/JobManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppprotos/Crane.protosrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Core/StepInstance.hsrc/Craned/Core/CMakeLists.txtsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/JobManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Supervisor/Supervisor.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/EmbeddedDbClient.cppsrc/Craned/Core/CMakeLists.txtsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Core/CranedForPamServer.cppsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/JobManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.hsrc/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.hsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-09-22T07:13:20.635Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/CraneCtld/TaskScheduler.cpp:2811-2874
Timestamp: 2025-09-22T07:13:20.635Z
Learning: In the CraneSched codebase, there are latch mechanisms used to ensure variable lifetime safety when capturing references in detached thread pool tasks, allowing reference captures that might otherwise be unsafe.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/SupervisorServer.hsrc/Craned/Core/CranedForPamServer.cppsrc/Craned/Core/CtldClient.cppsrc/Craned/Core/JobManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/Supervisor.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/StepInstance.cppsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/CranedServer.cppsrc/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Supervisor/Supervisor.cppsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Core/Craned.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/Craned/Core/SupervisorStub.hsrc/Craned/Core/SupervisorStub.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/Craned.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/Craned/Core/CtldClient.cppsrc/Utilities/PublicHeader/include/crane/String.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-09-21T11:26:40.935Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/Craned/Core/SupervisorKeeper.cpp:145-147
Timestamp: 2025-09-21T11:26:40.935Z
Learning: In C++17 and later, Class Template Argument Deduction (CTAD) allows `std::shared_ptr stub = std::make_shared<SupervisorStub>();` syntax to be valid, as the compiler can deduce the template parameter from make_shared. Projects using C++17+ don't need explicit template parameters in this context.
Applied to files:
src/Craned/Core/SupervisorStub.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.
Applied to files:
src/Craned/Core/JobManager.hsrc/Craned/Common/CgroupManager.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Core/Craned.cppsrc/Craned/Common/CgroupManager.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/CtldPublicDefs.cppsrc/Utilities/PublicHeader/include/crane/String.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/Craned.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Common/CgroupManager.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-15T02:17:20.669Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 477
File: src/Craned/TaskManager.cpp:781-798
Timestamp: 2025-04-15T02:17:20.669Z
Learning: The user prefers to implement separate stderr pattern handling for crun tasks only when there's a specific requirement for it, rather than implementing it proactively. The current implementation only supports redirecting stdout to a file with the `--output` option.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-02T09:52:59.318Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:697-701
Timestamp: 2025-04-02T09:52:59.318Z
Learning: When using bpf_map__get_next_key function, memory must be properly allocated (e.g., with std::make_unique<BpfKey>()) before passing the pointer to the function, as it writes the key to the provided memory address.
Applied to files:
src/Craned/Core/Craned.cppsrc/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (12)
src/CraneCtld/TaskScheduler.cpp (4)
src/Craned/Core/StepInstance.h (1)
exit_code(37-37)src/Craned/Supervisor/CforedClient.h (1)
exit_code(143-143)src/Craned/Supervisor/SupervisorPublicDefs.h (1)
exit_code(37-37)src/Craned/Core/CranedPublicDefs.h (1)
g_config(160-160)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/CranedForPamServer.cpp (2)
MigrateSshProcToCgroup(181-205)MigrateSshProcToCgroup(181-184)src/Craned/Supervisor/SupervisorServer.cpp (2)
MigrateSshProcToCgroup(95-103)MigrateSshProcToCgroup(95-98)
src/Craned/Core/SupervisorStub.h (1)
src/Craned/Core/SupervisorStub.cpp (20)
SupervisorStub(25-29)SupervisorStub(31-33)InitAndGetRecoveredMap(35-92)InitAndGetRecoveredMap(37-37)ExecuteStep(94-106)ExecuteStep(94-94)QueryStepEnv(108-121)QueryStepEnv(108-108)CheckStatus(123-137)CheckStatus(124-124)TerminateTask(139-158)TerminateTask(139-140)ChangeTaskTimeLimit(160-172)ChangeTaskTimeLimit(160-160)MigrateSshProcToCg(174-186)MigrateSshProcToCg(174-174)ShutdownSupervisor(188-198)ShutdownSupervisor(188-188)InitChannelAndStub_(200-204)InitChannelAndStub_(200-200)
src/Craned/Core/JobManager.h (2)
src/Craned/Core/JobManager.cpp (4)
ChangeStepTimelimit(486-511)ChangeStepTimelimit(486-488)QuerySshStepEnvVariables(513-529)QuerySshStepEnvVariables(513-514)src/Craned/Core/CranedServer.cpp (2)
QuerySshStepEnvVariables(295-316)QuerySshStepEnvVariables(295-298)
src/Craned/Supervisor/SupervisorServer.cpp (2)
src/Craned/Core/CranedForPamServer.cpp (2)
MigrateSshProcToCgroup(181-205)MigrateSshProcToCgroup(181-184)src/Craned/Core/SupervisorStub.cpp (2)
ShutdownSupervisor(188-198)ShutdownSupervisor(188-188)
src/CraneCtld/CtldPublicDefs.h (2)
src/Craned/Supervisor/TaskManager.cpp (6)
IsBatch(156-158)IsBatch(156-156)IsCalloc(169-172)IsCalloc(169-169)IsCrun(164-167)IsCrun(164-164)src/CraneCtld/CtldPublicDefs.cpp (8)
SetCranedIds(155-159)SetCranedIds(155-155)SetCranedIds(1128-1131)SetCranedIds(1128-1128)StepStatusChange(467-603)StepStatusChange(468-473)StepStatusChange(863-1059)StepStatusChange(864-869)
src/Craned/Core/StepInstance.cpp (1)
src/Craned/Common/CgroupManager.cpp (4)
AllocateAndGetCgroup(494-552)AllocateAndGetCgroup(495-497)CgroupStrByStepId(283-288)CgroupStrByStepId(283-284)
src/Craned/Supervisor/CforedClient.h (1)
src/Craned/Supervisor/CforedClient.cpp (12)
InitUvX11FwdHandler(109-115)InitUvX11FwdHandler(109-109)TaskProcessStop(725-738)TaskProcessStop(725-726)TaskOutPutForward(749-756)TaskOutPutForward(749-750)TaskX11ConnectForward(757-762)TaskX11ConnectForward(757-757)TaskX11OutPutForward(764-774)TaskX11OutPutForward(764-766)TaskX11OutputFinish(776-783)TaskX11OutputFinish(776-776)
src/Craned/Core/SupervisorStub.cpp (2)
src/Craned/Core/SupervisorStub.h (1)
SupervisorStub(28-61)src/Craned/Supervisor/TaskManager.h (2)
job_id(51-51)step_id(52-52)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (2)
GetHostname(277-285)GetHostname(277-277)src/Craned/Common/CgroupManager.cpp (2)
AllocateAndGetCgroup(494-552)AllocateAndGetCgroup(495-497)
src/Craned/Core/Craned.cpp (4)
src/Craned/Core/StepInstance.cpp (2)
StepInstance(28-33)StepInstance(35-43)src/Craned/Core/SupervisorStub.h (2)
Craned(26-62)SupervisorStub(28-61)src/Craned/Common/CgroupManager.cpp (6)
CgroupStrByParsedIds(296-308)CgroupStrByParsedIds(296-296)CgroupStrByStepId(283-288)CgroupStrByStepId(283-284)CgroupStrByJobId(279-281)CgroupStrByJobId(279-279)src/Craned/Core/SupervisorStub.cpp (4)
SupervisorStub(25-29)SupervisorStub(31-33)InitAndGetRecoveredMap(35-92)InitAndGetRecoveredMap(37-37)
src/Craned/Core/JobManager.cpp (3)
src/Craned/Supervisor/TaskManager.h (2)
job_id(51-51)step_id(52-52)src/CraneCtld/RpcService/CranedKeeper.cpp (3)
lock(410-410)lock(480-480)lock(839-839)src/Craned/Core/CranedServer.cpp (2)
QuerySshStepEnvVariables(295-316)QuerySshStepEnvVariables(295-298)
🪛 Cppcheck (2.19.0)
src/Craned/Core/StepInstance.cpp
[error] 142-142: Found an exit path from function with non-void return type that has missing return statement
(missingReturn)
src/Craned/Supervisor/TaskManager.cpp
[warning] 534-534: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.
(mismatchingContainerExpression)
src/Craned/Core/Craned.cpp
[error] 142-142: Found an exit path from function with non-void return type that has missing return statement
(missingReturn)
| if (!primary_step) { | ||
| CRANE_ERROR( | ||
| "[Job #{}] Daemon step not found when cancelling running job", | ||
| task_id); | ||
| return; | ||
| } | ||
| TerminateRunningStepNoLock_(daemon_step); | ||
| TerminateRunningStepNoLock_(primary_step); | ||
| auto& cancelled_job_steps = *reply.mutable_cancelled_steps(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log message still mentions daemon step while now operating on primary step
In the “cancel jobs with filters” branch, the code now uses task->PrimaryStep() as the representative running step:
auto primary_step = task->PrimaryStep();
if (!primary_step) {
CRANE_ERROR(
"[Job #{}] Daemon step not found when cancelling running job",
task_id);
return;
}
TerminateRunningStepNoLock_(primary_step);The error message still says “Daemon step not found”, which is misleading if a job intentionally has only a daemon or other common steps. Consider updating the log text to reflect the actual expectation (e.g., “primary step not found” or “no representative step found for running job”), but the functional change to use PrimaryStep() is otherwise sound.
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1862 to 1869, the error log
still says "Daemon step not found" but the code now uses task->PrimaryStep();
update the log text to accurately reflect the expectation (for example, "Primary
step not found when cancelling running job" or "No representative step found for
running job") keeping the task_id in the message; make this single-string change
where CRANE_ERROR is invoked so the log matches the actual check and early
return.
| constexpr int KParsedJobIdIdx = 0; | ||
| constexpr int KParsedStepIdIdx = 1; | ||
| constexpr int KParsedSystemFlagIdx = 2; | ||
| constexpr int KParsedTaskIdIdx = 3; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsed ID tuple refactor is a solid foundation; update nearby docstring
Switching to CgroupStrParsedIds as
using CgroupStrParsedIds =
std::tuple<std::optional<job_id_t>, std::optional<step_id_t>,
bool /*true if system step*/, std::optional<task_id_t>>;plus the KParsed*Idx constants, and updating:
GetIdsFromCgroupV1_/GetIdsFromCgroupV2_to returnstd::set<CgroupStrParsedIds>,GetCgInoJobIdMapCgroupV2_andGetJobBpfMapCgroupsV2_to key onCgroupStrParsedIds,CgroupStrByParsedIdshelper,
gives a much richer and type-safe way to carry job/step/system/task identity through cgroup/BPF plumbing.
One minor nit: the AllocateAndGetCgroup comment still mentions an alloc_mem parameter that no longer exists in the signature; consider updating that doc line to avoid confusion.
Also applies to: 528-553, 610-620, 634-636
🤖 Prompt for AI Agents
In src/Craned/Common/CgroupManager.h around lines 112-116 (also check similar
comments at 528-553, 610-620, 634-636), the AllocateAndGetCgroup docstring still
references an alloc_mem parameter that no longer exists; remove that reference
and update the comment to reflect the current function signature and behavior
(briefly describe returned/allocated resources and ownership semantics if
applicable), and ensure any mention of parsed ID handling aligns with the new
CgroupStrParsedIds tuple usage elsewhere.
| for (auto ids : rn_job_ids_with_cg) { | ||
| auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids; | ||
| job_id_t job_id = job_id_opt.value(); | ||
| // For craned, we don't care task cgroup | ||
| if (task_id_opt.has_value()) continue; | ||
| if (step_id_opt.has_value()) { | ||
| step_id_t step_id = step_id_opt.value(); | ||
| // Step cgroup recovery | ||
| if (rn_step_from_ctld.contains({job_id, step_id})) { | ||
| // Step is found in both ctld and cgroup | ||
| auto& step_instance = rn_step_from_ctld.at({job_id, step_id}); | ||
|
|
||
| // For common step cgroup, craned only cares about system cgroup | ||
| // For daemon step cgroup, craned cares about both system and user | ||
| // cgroup | ||
| if (!step_instance->IsDaemonStep() && !system_flag) { | ||
| continue; | ||
| } | ||
| CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id, | ||
| job_id); | ||
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | ||
| CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag), | ||
| step_instance->step_to_d.res(), true); | ||
| if (cg_expt.has_value()) { | ||
| if (system_flag) | ||
| step_instance->crane_cgroup = std::move(cg_expt.value()); | ||
| continue; | ||
| } else { | ||
| // If the cgroup is found but is unrecoverable, just logging out. | ||
| CRANE_ERROR( | ||
| "Cgroup for step {} of job #{} is found but not " | ||
| "recoverable.", | ||
| step_id, job_id); | ||
| } | ||
| } else { | ||
| // Job is found in cgroup but not in ctld. Cleaning up. | ||
| CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id, | ||
| step_id); | ||
| clean_invalid_cg(ids); | ||
| } | ||
| } | ||
|
|
||
| // Job cgroup recovery | ||
| if (rn_jobs_from_ctld.contains(job_id)) { | ||
| // Job is found in both ctld and cgroup | ||
| JobInD& job = rn_jobs_from_ctld.at(job_id); | ||
|
|
||
| CRANE_DEBUG("Recover existing cgroup for job #{}", job_id); | ||
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | ||
| CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true); | ||
| if (cg_expt.has_value()) { | ||
| // Job cgroup is recovered. | ||
| job.cgroup = std::move(cg_expt.value()); | ||
| } else { | ||
| // If the cgroup is found but is unrecoverable, just logging out. | ||
| CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id); | ||
| } | ||
| continue; | ||
| } | ||
| clean_invalid_cg(ids); | ||
| // Job is found in cgroup but not in ctld. Cleaning up. | ||
| CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic issue: step entries may trigger incorrect cleanup in job recovery block.
When processing step entries, the code falls through to the job recovery section (lines 181-200). This causes two problems:
-
Double cleanup: For step entries where the step is NOT in ctld (lines 173-178),
clean_invalid_cg(ids)is called for the step, then potentially called again at line 198 with the same stepidsif the parent job is also not in ctld. -
Misleading log: Line 200 logs "Removing cgroup for job" but
idsmay be a step entry.
Consider adding a continue after the step cleanup at line 177, or restructure the loop to handle job entries and step entries separately.
🔎 Proposed fix
} else {
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id,
step_id);
clean_invalid_cg(ids);
+ continue;
}
}
// Job cgroup recovery📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (auto ids : rn_job_ids_with_cg) { | |
| auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids; | |
| job_id_t job_id = job_id_opt.value(); | |
| // For craned, we don't care task cgroup | |
| if (task_id_opt.has_value()) continue; | |
| if (step_id_opt.has_value()) { | |
| step_id_t step_id = step_id_opt.value(); | |
| // Step cgroup recovery | |
| if (rn_step_from_ctld.contains({job_id, step_id})) { | |
| // Step is found in both ctld and cgroup | |
| auto& step_instance = rn_step_from_ctld.at({job_id, step_id}); | |
| // For common step cgroup, craned only cares about system cgroup | |
| // For daemon step cgroup, craned cares about both system and user | |
| // cgroup | |
| if (!step_instance->IsDaemonStep() && !system_flag) { | |
| continue; | |
| } | |
| CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id, | |
| job_id); | |
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | |
| CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag), | |
| step_instance->step_to_d.res(), true); | |
| if (cg_expt.has_value()) { | |
| if (system_flag) | |
| step_instance->crane_cgroup = std::move(cg_expt.value()); | |
| continue; | |
| } else { | |
| // If the cgroup is found but is unrecoverable, just logging out. | |
| CRANE_ERROR( | |
| "Cgroup for step {} of job #{} is found but not " | |
| "recoverable.", | |
| step_id, job_id); | |
| } | |
| } else { | |
| // Job is found in cgroup but not in ctld. Cleaning up. | |
| CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id, | |
| step_id); | |
| clean_invalid_cg(ids); | |
| } | |
| } | |
| // Job cgroup recovery | |
| if (rn_jobs_from_ctld.contains(job_id)) { | |
| // Job is found in both ctld and cgroup | |
| JobInD& job = rn_jobs_from_ctld.at(job_id); | |
| CRANE_DEBUG("Recover existing cgroup for job #{}", job_id); | |
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | |
| CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true); | |
| if (cg_expt.has_value()) { | |
| // Job cgroup is recovered. | |
| job.cgroup = std::move(cg_expt.value()); | |
| } else { | |
| // If the cgroup is found but is unrecoverable, just logging out. | |
| CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id); | |
| } | |
| continue; | |
| } | |
| clean_invalid_cg(ids); | |
| // Job is found in cgroup but not in ctld. Cleaning up. | |
| CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id); | |
| } | |
| for (auto ids : rn_job_ids_with_cg) { | |
| auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids; | |
| job_id_t job_id = job_id_opt.value(); | |
| // For craned, we don't care task cgroup | |
| if (task_id_opt.has_value()) continue; | |
| if (step_id_opt.has_value()) { | |
| step_id_t step_id = step_id_opt.value(); | |
| // Step cgroup recovery | |
| if (rn_step_from_ctld.contains({job_id, step_id})) { | |
| // Step is found in both ctld and cgroup | |
| auto& step_instance = rn_step_from_ctld.at({job_id, step_id}); | |
| // For common step cgroup, craned only cares about system cgroup | |
| // For daemon step cgroup, craned cares about both system and user | |
| // cgroup | |
| if (!step_instance->IsDaemonStep() && !system_flag) { | |
| continue; | |
| } | |
| CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id, | |
| job_id); | |
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | |
| CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag), | |
| step_instance->step_to_d.res(), true); | |
| if (cg_expt.has_value()) { | |
| if (system_flag) | |
| step_instance->crane_cgroup = std::move(cg_expt.value()); | |
| continue; | |
| } else { | |
| // If the cgroup is found but is unrecoverable, just logging out. | |
| CRANE_ERROR( | |
| "Cgroup for step {} of job #{} is found but not " | |
| "recoverable.", | |
| step_id, job_id); | |
| } | |
| } else { | |
| // Job is found in cgroup but not in ctld. Cleaning up. | |
| CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id, | |
| step_id); | |
| clean_invalid_cg(ids); | |
| continue; | |
| } | |
| } | |
| // Job cgroup recovery | |
| if (rn_jobs_from_ctld.contains(job_id)) { | |
| // Job is found in both ctld and cgroup | |
| JobInD& job = rn_jobs_from_ctld.at(job_id); | |
| CRANE_DEBUG("Recover existing cgroup for job #{}", job_id); | |
| auto cg_expt = CgroupManager::AllocateAndGetCgroup( | |
| CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true); | |
| if (cg_expt.has_value()) { | |
| // Job cgroup is recovered. | |
| job.cgroup = std::move(cg_expt.value()); | |
| } else { | |
| // If the cgroup is found but is unrecoverable, just logging out. | |
| CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id); | |
| } | |
| continue; | |
| } | |
| clean_invalid_cg(ids); | |
| // Job is found in cgroup but not in ctld. Cleaning up. | |
| CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id); | |
| } |
🧰 Tools
🪛 Cppcheck (2.19.0)
[error] 142-142: Found an exit path from function with non-void return type that has missing return statement
(missingReturn)
🤖 Prompt for AI Agents
In src/Craned/Core/Craned.cpp around lines 139-201 the loop handling
rn_job_ids_with_cg processes step entries then falls through into the
job-recovery block causing double clean_invalid_cg calls and a misleading
"Removing cgroup for job" log for step entries; after the branch that handles
"Job is found in cgroup but not in ctld" for steps (where clean_invalid_cg(ids)
is called around line 177) add a continue to skip the subsequent job-recovery
logic, and ensure the "Removing cgroup for job #{} not in Ctld." debug log is
only emitted for pure job entries (i.e., place or guard that log so it does not
run for step entries).
| auto task_env_map = | ||
| g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId); | ||
| if (task_env_map.error()) { | ||
| CRANE_ERROR("Failed to get step env of job #{}", request->task_id()); | ||
| return Status::OK; | ||
| } | ||
|
|
||
| auto task_env_map = stub->QueryStepEnv(); | ||
| if (task_env_map.has_value()) { | ||
| for (const auto &[name, value] : task_env_map.value()) | ||
| response->mutable_env_map()->emplace(name, value); | ||
| response->set_ok(true); | ||
| } | ||
| for (const auto &[name, value] : task_env_map.value()) | ||
| response->mutable_env_map()->emplace(name, value); | ||
| response->set_ok(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect use of .error() for checking CraneExpected failure.
CraneExpected<EnvMap>::error() returns the error value, not a boolean indicating failure. Use !task_env_map.has_value() or check the negation operator instead.
🔎 Proposed fix
auto task_env_map =
g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId);
- if (task_env_map.error()) {
+ if (!task_env_map.has_value()) {
CRANE_ERROR("Failed to get step env of job #{}", request->task_id());
return Status::OK;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto task_env_map = | |
| g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId); | |
| if (task_env_map.error()) { | |
| CRANE_ERROR("Failed to get step env of job #{}", request->task_id()); | |
| return Status::OK; | |
| } | |
| auto task_env_map = stub->QueryStepEnv(); | |
| if (task_env_map.has_value()) { | |
| for (const auto &[name, value] : task_env_map.value()) | |
| response->mutable_env_map()->emplace(name, value); | |
| response->set_ok(true); | |
| } | |
| for (const auto &[name, value] : task_env_map.value()) | |
| response->mutable_env_map()->emplace(name, value); | |
| response->set_ok(true); | |
| auto task_env_map = | |
| g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId); | |
| if (!task_env_map.has_value()) { | |
| CRANE_ERROR("Failed to get step env of job #{}", request->task_id()); | |
| return Status::OK; | |
| } | |
| for (const auto &[name, value] : task_env_map.value()) | |
| response->mutable_env_map()->emplace(name, value); | |
| response->set_ok(true); |
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 305-313, the code incorrectly
uses task_env_map.error() as a boolean; change the check to if
(!task_env_map.has_value()) (or if (!task_env_map)) to detect failure, and
include the returned error details in the log (e.g. log task_env_map.error()
alongside the message) before returning; do not attempt to access
task_env_map.value() unless it has a value.
| auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId, | ||
| request->time_limit_seconds()); | ||
|
|
||
| if (err.error()) { | ||
| CRANE_ERROR("[Step #{}.{}] Failed to change task time limit", | ||
| request->task_id(), kPrimaryStepId); | ||
| return Status::OK; | ||
| } | ||
| auto err = | ||
| stub->ChangeTaskTimeLimit(absl::Seconds(request->time_limit_seconds())); | ||
| if (err != CraneErrCode::SUCCESS) { | ||
| CRANE_ERROR("[Step #{}.{}] Failed to change time limit: {}.", job_id, | ||
| step_id, CraneErrStr(err)); | ||
| } else { | ||
| response->set_ok(true); | ||
| } | ||
| response->set_ok(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue: Incorrect use of .error() for checking CraneExpected failure.
Similar to the previous issue, err.error() does not return a boolean but the error value. This will not behave as intended.
🔎 Proposed fix
auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId,
request->time_limit_seconds());
- if (err.error()) {
+ if (!err.has_value()) {
CRANE_ERROR("[Step #{}.{}] Failed to change task time limit",
request->task_id(), kPrimaryStepId);
return Status::OK;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId, | |
| request->time_limit_seconds()); | |
| if (err.error()) { | |
| CRANE_ERROR("[Step #{}.{}] Failed to change task time limit", | |
| request->task_id(), kPrimaryStepId); | |
| return Status::OK; | |
| } | |
| auto err = | |
| stub->ChangeTaskTimeLimit(absl::Seconds(request->time_limit_seconds())); | |
| if (err != CraneErrCode::SUCCESS) { | |
| CRANE_ERROR("[Step #{}.{}] Failed to change time limit: {}.", job_id, | |
| step_id, CraneErrStr(err)); | |
| } else { | |
| response->set_ok(true); | |
| } | |
| response->set_ok(true); | |
| auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId, | |
| request->time_limit_seconds()); | |
| if (!err.has_value()) { | |
| CRANE_ERROR("[Step #{}.{}] Failed to change task time limit", | |
| request->task_id(), kPrimaryStepId); | |
| return Status::OK; | |
| } | |
| response->set_ok(true); |
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 328 to 336, the code
incorrectly uses err.error() as a boolean; change the check to detect failure
with if (!err) and then use err.error() (or the appropriate accessor) only when
logging the failure; update the CRANE_ERROR call to include the actual error
details from err.error() and keep returning Status::OK and setting
response->set_ok(true) on success.
| chmod(g_config.CranedUnixSockPath.c_str(), 0600); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chmod before server start may fail if socket file doesn't exist yet.
The socket file is typically created by builder.BuildAndStart(). Calling chmod before the socket exists will silently fail (returning -1). Consider moving this after BuildAndStart() or checking the return value.
🔎 Proposed fix
builder.RegisterService(m_service_impl_.get());
- chmod(g_config.CranedUnixSockPath.c_str(), 0600);
-
m_server_ = builder.BuildAndStart();
+
+ if (chmod(g_config.CranedUnixSockPath.c_str(), 0600) != 0) {
+ CRANE_WARN("Failed to chmod Craned Unix socket: {}", strerror(errno));
+ }
+
CRANE_INFO("Craned is listening on [{}, {}:{}]",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chmod(g_config.CranedUnixSockPath.c_str(), 0600); | |
| builder.RegisterService(m_service_impl_.get()); | |
| m_server_ = builder.BuildAndStart(); | |
| if (chmod(g_config.CranedUnixSockPath.c_str(), 0600) != 0) { | |
| CRANE_WARN("Failed to chmod Craned Unix socket: {}", strerror(errno)); | |
| } | |
| CRANE_INFO("Craned is listening on [{}, {}:{}]", |
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 537-538, the call to
chmod(g_config.CranedUnixSockPath.c_str(), 0600) runs before the server creates
the socket and can fail; move the chmod to immediately after
builder.BuildAndStart() (or after the code path that creates the socket) so the
file exists, and check the return value of chmod — on failure log or return the
errno (and include errno/strerror in the log) so failures are visible instead of
silently ignored.
| } else { | ||
| // For now, the action never fail | ||
| std::unreachable(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unreachable() may cause undefined behavior if failure path becomes reachable.
Using std::unreachable() here is dangerous. If QueryStepEnvAsync ever fails in the future (due to code changes or edge cases), this will invoke undefined behavior instead of returning a proper error. Consider returning an error status or logging.
🔎 Proposed fix
} else {
- // For now, the action never fail
- std::unreachable();
+ // For now, the action should never fail, but handle gracefully
+ CRANE_ERROR("[Step] QueryStepEnvAsync failed unexpectedly.");
+ return Status::OK;
}🤖 Prompt for AI Agents
In src/Craned/Supervisor/SupervisorServer.cpp around lines 53 to 56, replace the
unsafe std::unreachable() in the else branch with proper error handling: detect
the unexpected failure path, log a descriptive error (including any available
error details), and return an appropriate error status or error response
consistent with the function's return type (or throw a well-documented exception
if that matches project conventions) so the failure is handled safely instead of
invoking undefined behavior.
| auto cfored_client = GetCforedClient(); | ||
| if (x11) { | ||
| X11Meta x11_meta{}; | ||
| if (x11_fwd) | ||
| x11_meta.x11_port = cfored_client->InitUvX11FwdHandler(); | ||
| else | ||
| x11_meta.x11_port = GetStep().interactive_meta().x11_meta().port(); | ||
|
|
||
| CRANE_TRACE("Crun task x11 enabled. Forwarding: {}, X11 Port: {}", | ||
| x11_fwd, x11_meta.x11_port); | ||
|
|
||
| auto x11_auth_path = | ||
| fmt::sprintf("%s/.crane/xauth/.Xauthority-XXXXXX", pwd.HomeDir()); | ||
|
|
||
| bool ok = util::os::CreateFoldersForFileEx(x11_auth_path, pwd.Uid(), | ||
| pwd.Gid(), 0700); | ||
| if (!ok) { | ||
| CRANE_ERROR("Failed to create xauth source file for"); | ||
| return CraneErrCode::ERR_SYSTEM_ERR; | ||
| } | ||
|
|
||
| // Default file permission is 0600. | ||
| int xauth_fd = mkstemp(x11_auth_path.data()); | ||
| if (xauth_fd == -1) { | ||
| CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno)); | ||
| return CraneErrCode::ERR_SYSTEM_ERR; | ||
| } | ||
|
|
||
| int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid()); | ||
| if (ret == -1) { | ||
| CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno)); | ||
| return CraneErrCode::ERR_SYSTEM_ERR; | ||
| } | ||
| x11_meta.x11_auth_path = x11_auth_path; | ||
| this->x11_meta = std::move(x11_meta); | ||
| close(xauth_fd); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up xauth file on early failures in X11 setup
When mkstemp or fchown fail, the temporary xauth file has already been created but is not recorded in x11_meta and thus will never be removed by StepInstance::CleanUp(). This can leave stray .Xauthority-XXXXXX files in the user's home under repeated failures.
Consider explicitly unlinking x11_auth_path (and closing xauth_fd if open) before returning an error in these branches, so the filesystem state is consistent even on partial failures.
Signed-off-by: junlinli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
844-869: Remove the unconditionalio_metacopy that defeats the presence guard.In
CommonStepInCtld::GetStepToD(), the code conditionally copiesio_metawhen present, but then unconditionally copies it again:if (StepToCtld().has_io_meta()) { auto* mutable_meta = step_to_d.mutable_io_meta(); mutable_meta->CopyFrom(StepToCtld().io_meta()); } step_to_d.set_sh_script(StepToCtld().sh_script()); step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta());The final
mutable_io_meta()->CopyFrom(...)forcesstep_to_d.io_metato be present even whenStepToCtld().has_io_meta()is false, defeating the guard and breaking semantics for callers that rely on the presence of this field.Delete the final unconditional copy line.
♻️ Duplicate comments (3)
src/Craned/Supervisor/TaskManager.cpp (2)
71-107: Clean up xauth file on early failures in X11 setupWhen
mkstempsucceeds butfchownfails (lines 99-103), the temporary xauth file has already been created but won't be recorded inx11_meta. This leaves orphaned.Xauthority-XXXXXXfiles in the user's home directory on repeated failures.🔎 Proposed fix
int xauth_fd = mkstemp(x11_auth_path.data()); if (xauth_fd == -1) { CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno)); return CraneErrCode::ERR_SYSTEM_ERR; } int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid()); if (ret == -1) { CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno)); + close(xauth_fd); + unlink(x11_auth_path.c_str()); return CraneErrCode::ERR_SYSTEM_ERR; }
489-495: Handle missing node in nodelist gracefully.If the current node is not found in the nodelist (line 490), the function logs an error and returns
path_to_parsewith unreplaced%nplaceholders. This could result in file paths containing literal%n.🔎 Proposed fix to use fallback node_index
auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode); + int node_index = 0; if (node_it == grpc_nodelist.end()) { - CRANE_ERROR("Failed to find current node {} in nodelist", - g_config.CranedIdOfThisNode); - return path_to_parse; + CRANE_WARN("Failed to find current node {} in nodelist, using node_index=0", + g_config.CranedIdOfThisNode); + } else { + node_index = std::distance(grpc_nodelist.begin(), node_it); } - int node_index = std::distance(grpc_nodelist.begin(), node_it);src/CraneCtld/DbClient.cpp (1)
1830-1905: DB serialization now consistently usessh_script; duplicate assignment fixed.Mapping the
scriptBSON field toTaskToCtld::sh_script()/StepToCtld::sh_script()in all four document builders is consistent with the proto changes, and the previous duplicate initialization issue is resolved.Also applies to: 1907-1913, 2005-2007, 2071-2077
🧹 Nitpick comments (3)
src/Craned/Supervisor/CforedClient.h (2)
172-174: Potential integer overflow onnext_x11_id_.
next_x11_id_is auint32_tthat increments without bounds checking. While unlikely in practice (would require ~4 billion X11 connections), for long-running systems this could eventually wrap around and cause ID collisions. Consider whether this is acceptable for your use case.
60-61: Remove unused type aliasx11_id_t.This type alias is defined but never used in the codebase. Removing it eliminates dead code.
src/Craned/Supervisor/CforedClient.cpp (1)
725-767: Consider holding mutex duringWriteStringToFd_call or document thread safety.The mutex is held while accessing
m_fwd_meta_mapandm_x11_fd_info_map_, butWriteStringToFd_(which writes to the fd) is also called within the locked region. This is fine for correctness but could cause lock contention if writes block. The current approach is acceptable given that the fds should be non-blocking, but consider adding a comment noting this assumption.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
protos/Crane.protoprotos/PublicDefs.protosrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/DbClient.cppsrc/CraneCtld/RpcService/CtldGrpcServer.hsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
🧰 Additional context used
🧠 Learnings (32)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.hsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.hsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.hsrc/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/CforedClient.hsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/CforedClient.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cppprotos/Crane.protosrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/CforedClient.hsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/Craned/Supervisor/CforedClient.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
🧬 Code graph analysis (3)
src/Craned/Supervisor/CforedClient.cpp (5)
src/Utilities/PublicHeader/OS.cpp (2)
SetFdNonBlocking(153-160)SetFdNonBlocking(153-153)src/Craned/Supervisor/CforedClient.h (3)
next_x11_id_(172-172)exit_code(151-151)signaled(152-153)src/Utilities/PublicHeader/String.cpp (2)
SetCurrentThreadName(371-379)SetCurrentThreadName(371-371)src/Craned/Core/StepInstance.h (1)
exit_code(37-37)src/Craned/Supervisor/SupervisorPublicDefs.h (1)
exit_code(37-37)
src/Craned/Supervisor/CforedClient.h (1)
src/Craned/Supervisor/CforedClient.cpp (12)
TaskProcessStop(825-838)TaskProcessStop(825-826)TaskOutPutForward(849-857)TaskOutPutForward(849-850)TaskErrOutPutForward(858-866)TaskErrOutPutForward(858-859)TaskX11ConnectForward(868-873)TaskX11ConnectForward(868-868)TaskX11OutPutForward(875-885)TaskX11OutPutForward(875-877)TaskX11OutputFinish(887-894)TaskX11OutputFinish(887-887)
src/CraneCtld/CtldPublicDefs.h (2)
src/Craned/Supervisor/TaskManager.cpp (6)
IsBatch(156-158)IsBatch(156-156)IsCalloc(169-172)IsCalloc(169-169)IsCrun(164-167)IsCrun(164-164)src/CraneCtld/CtldPublicDefs.cpp (4)
SetCranedIds(155-159)SetCranedIds(155-155)SetCranedIds(1138-1141)SetCranedIds(1138-1138)
🪛 Cppcheck (2.19.0)
src/Craned/Supervisor/TaskManager.cpp
[warning] 536-536: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.
(mismatchingContainerExpression)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (39)
src/Craned/Supervisor/CforedClient.h (3)
29-36: LGTM!The
X11FdInfostruct cleanly captures per-X11-connection state with the newx11_input_stoppedflag for tracking input flow control.
38-58: LGTM!The
TaskFwdMetastruct is well-extended to support stderr handling with the newerr_handle,stderr_read, anderr_stoppedfields. The structure maintains symmetry between stdout and stderr handling.
129-161: LGTM - Well-designed unified forwarding queue model.The
FwdRequeststructure withstd::variantcleanly consolidates all forwarding payload types (IOFwdRequest,X11FwdConnectReq,X11FwdReq,X11FwdEofReq,TaskFinishStatus) into a single queue interface. This design simplifies the forwarding pipeline and ensures type safety.src/Craned/Supervisor/CforedClient.cpp (9)
83-84: LGTM!Good defensive assertion to ensure all task forwarding metadata is cleaned up before destruction.
88-113: LGTM!The function signature and implementation are properly updated to handle the new
stderr_readparameter. The early marking ofinput_stoppedwhenstdin_write == -1is a sensible optimization.
157-203: LGTM - Clean per-connection X11 lifecycle management.The X11 forwarding setup properly:
- Assigns unique local IDs via
next_x11_id_++- Registers the connection with
TaskX11ConnectForward- Sets up event handlers for data, EOF, error, and close events
- Stores connection state in
m_x11_fd_info_map_under mutex protection
340-359: LGTM - Stderr pipe handlers correctly implemented.The stderr pipe handlers properly set
err_stopped = trueand conditionally callon_finish()only whenoutput_stoppedis also true. This correctly implements the "both streams must finish" logic.
369-407: LGTM - PTY mode correctly marks stderr as stopped.Since PTY mode combines stdout/stderr into a single stream, marking
err_stopped = trueat line 370 is the correct approach to ensure the finish logic works properly.
430-464: LGTM - Cleanup logic properly extended for stderr.The
CleanStopTaskIOQueueCb_function correctly handles the new stderr resources by closing and resettingerr_handleand closingstderr_readfd.
490-558: LGTM - Clean variant-based message serialization.The
VariantVisitorpattern cleanly handles allFwdRequesttypes with appropriate protobuf payload construction. The code properly setscraned_idfor X11 messages and handles both stdout/stderr through theis_stdoutflag.
840-847: LGTM!The
TaskEndfunction properly removes the task fromm_fwd_meta_mapunder mutex protection and triggers the status change callback.
849-894: LGTM - Forward functions correctly implemented.All the new task forwarding functions (
TaskOutPutForward,TaskErrOutPutForward,TaskX11ConnectForward,TaskX11OutPutForward,TaskX11OutputFinish) properly enqueue requests to the unifiedm_task_fwd_req_queue_with appropriate payload types. TheTaskX11OutputFinishcorrectly erases the X11 fd info under mutex protection.src/Craned/Supervisor/TaskManager.cpp (9)
112-150: LGTM on cleanup logic.The cleanup properly handles X11 auth file removal with a guard against uninitialized template paths, and the cgroup cleanup retry loop is appropriately bounded.
223-298: LGTM on status transition validation.The state machine properly validates status transitions with helpful warning logs for unexpected states while avoiding blocking errors. The use of
std::unreachable()is appropriate for C++23.
371-412: LGTM on per-task cgroup lifecycle.The per-task cgroup allocation in
Prepare()and cleanup inCleanup()properly support the multi-task execution model with appropriate retry bounds for process termination.
588-700: LGTM on fd management in ForkCrunAndInitMeta_.The use of
opened_fdset to track file descriptors for cleanup on error paths is a good pattern. The function properly handles the complexity of stdin/stdout/stderr pipe setup for both forwarding and file-based IO.
1810-1820: LGTM on child process setup.The child process correctly migrates to the task cgroup early in the fork path before setting other properties. Using
std::abort()on cgroup migration failure is appropriate for the child process context.
2457-2520: LGTM on terminate task queue handling.The termination logic properly handles the different cases: daemon steps with orphaned flag, empty task maps, and the re-queue pattern for not-ready elements is consistent with other handlers.
2564-2664: LGTM on task execution callback.The execution flow properly:
- Creates task instances before step preparation
- Fails all tasks on step preparation error
- Allocates step cgroup before launching tasks
- Tracks execution IDs for task-to-pid mapping
2680-2709: LGTM on SSH process migration to cgroup.The implementation correctly:
- Validates that only daemon steps can perform SSH process migration
- Lazily allocates the step cgroup if not already present
- Returns appropriate error codes for different failure modes
2318-2330: LGTM on exit code calculation for crun tasks.The exit code calculation properly adds
KCrunExitCodeStatusNumfor signaled processes, maintaining consistency with shell conventions (128 + signal number).src/Craned/Supervisor/TaskManager.h (6)
65-86: LGTM on new struct definitions.The
X11MetaandTerminationStatusstructs are well-defined with clear purpose:
X11Metaencapsulates X11 forwarding stateTerminationStatustracks final step termination info for multi-task scenarios
90-106: LGTM on StepInstance constructor.The initialization of
task_idsfromstep.task_res_map()keys using C++23 ranges is clean and correct.
212-225: LGTM on CrunInstanceMeta extension.The new fields (
stderr_write,stderr_read,fwd_stdint,fwd_stdout,fwd_stderr) properly support the batch input/output/error option feature from the PR objectives.
235-278: LGTM on ITaskInstance changes.The per-task cgroup member (
m_task_cg) and helper methods (IsBatch(),IsCrun()) properly support the multi-task execution model. The constructor correctly acceptstask_idas a parameter.
433-501: LGTM on TaskManager async method declarations.The new methods follow the established async pattern with queue/handle pairs. The default parameters for
ShutdownSupervisorAsyncprovide sensible defaults for the common success case.
549-591: LGTM on async infrastructure additions.The new async handles and concurrent queues follow the established patterns in the class. The queue types correctly correspond to their respective async method signatures.
protos/PublicDefs.proto (3)
109-121: Status enum rename toStartinglooks consistent but check all status-group logic.
TaskStatusgainedStarting = 8(replacing prior “Configured” semantics). Make sure all status-group checks (e.g. “running-like” states in schedulers, DB queries, and recovery code) were updated to treatStartingappropriately (e.g., included withRunningwhere intended) to avoid misclassification.
135-186: IO meta + script threading is coherent; confirm all producers fill them consistently.
IoMetaandsh_scriptare now carried inTaskToCtld,StepToCtld, andStepToD, withStepToDalso addingtask_res_mapandntasks_total. The design matches the new per-task IO model, but you should double‑check that:
- All submission paths populate
io_meta/sh_scriptas expected (batch, crun, container).ntasks_totalmatchestask_res_map.size()for execution steps, and that daemon steps’ approximation doesn’t confuse consumers.Also applies to: 218-255, 311-352
364-379:IoMetaandLicenseInfomessage definitions look sound.The new
IoMeta(with append mode, file patterns, and crun task routing) andLicenseInfomessages are well-structured and align with how they’re consumed in other files; no structural issues spotted.Also applies to: 829-834
src/CraneCtld/RpcService/CtldGrpcServer.h (1)
73-107: NewResAllocInfohandling inWriteTaskResAllocReplyis correct.The method correctly consumes
StepResAllocArgs’sResAllocInfo, populatingallocated_craned_regex,craned_ids,craned_task_map, andntasks_total, and propagates errors viafailure_reason. Provided the producer keepsntasks_totalandcraned_task_mapconsistent, this looks good.src/CraneCtld/CtldPublicDefs.cpp (3)
149-159: Craned‑ID propagation and daemon step initialization look consistent.
StepInCtld::SetCranedIdsnow mirrors the vector‑based storage and keepsRuntimeAttr.craned_idsin sync.DaemonStepInCtld::InitFromJobderives a unifiednode_setfromjob.CranedIds()for execution/configuring/running nodes, andGetStepToDsetsntasks_totalfromjob->ntasks_per_node * job->node_num(with a FIXME). This is internally consistent; just ensure consumers ofntasks_totaltolerate the approximation for daemon steps.Also applies to: 264-275, 333-415
469-608: NewStepStatusChangesemantics andAllExecutionStepsFinished()look reasonable.Both daemon and common steps now return
std::optional<std::pair<TaskStatus,uint32_t>>, with non‑nullopt indicating the job has a final status/exit code:
- Daemon step returns either its own status (if primary step never created) or the primary step’s status.
- Common step returns primary step status once all execution steps (primary + common) are gone, as detected by
AllExecutionStepsFinished().This matches the intended “job‑level” completion semantics; just verify the RPC handler uses the returned pair (when present) as the single source of truth for updating
TaskInCtldand persistence.Also applies to: 873-911, 918-920
632-747: Per‑task resource mapping is coherent; ensure consistency between maps and counts.Primary and dynamically scheduled common steps now populate:
CommonStepInCtld::task_res_mapwith per‑taskResourceInNode(zeroing per‑task mem when step‑level mem is used), andCommonStepInCtld::craned_task_mapmapping eachCranedIdto its local task IDs.
CommonStepInCtld::GetStepToDexposestask_res_mapandntasks_total = task_res_map.size()intoStepToD, and the scheduler’sSchedulePendingStepspopulates the same structures for crun common steps. This is consistent with the new proto schema; just keep an eye on:
- Batch/Calloc vs crun semantics (single aggregated task vs per‑task mapping), and
ntasks_totalalways matchingtask_res_map.size()for execution steps.Also applies to: 802-871, 958-1037
protos/Crane.proto (2)
476-479: New IO/X11/exit‑status payloads are structurally consistent; verify end‑to‑end wiring.You’ve extended:
StreamCtldReply.TaskResAllocatedReplyandStreamCrunReply.TaskResAllocatedReplywith per‑cranedTasksmaps andntasks_total.StreamCrun{Request,Reply}andStreamTaskIO{Request,Reply}with task‑scoped IO, X11 connection/EOF, error‑output, and exit‑status messages.The schemas look coherent and use fresh tags; please confirm that:
- All senders populate
craned_task_mapandntasks_totalconsistently with the server‑sideResAllocInfo, and- Supervisors/frontends handle the new X11/IO/exit‑status variants without breaking older behaviors.
Also applies to: 544-563, 704-734, 814-901, 904-965, 968-1003
544-547:QueryNodeEvents*andPluginQueryServiceadditions look fine; check client compatibility.The
QueryNodeEvents{Request,Reply}/NodeEventInfomessages and the newPluginQueryService(withQueryTaskEfficiencyandQueryNodeEvents) are cleanly defined. Just ensure:
- All plugin clients are updated to use
PluginQueryService(if this replaces an older service name), and- Authorization via the
uidfields is enforced on the server side.Also applies to: 1170-1173
src/CraneCtld/CtldPublicDefs.h (3)
315-353: RicherStepResAllocArgsAPI is well‑designed; confirm all callbacks updated.
InteractiveMeta::StepResAllocArgsnow wraps aResAllocInfoinsidestd::expected, carrying regex, orderedallocated_craned_ids,craned_task_map, andntasks_total. This is a clear improvement over the previous pair‑based API; just ensure allcb_step_res_allocatedimplementations (Cfored/Crun) have been updated to the new structure.
526-590: Switching steps’m_craned_ids_tostd::vectorand adding type helpers looks good.Using
std::vector<CranedId>form_craned_ids_and exposing it viaSetCranedIds/CranedIds()aligns step storage with job‑sidecraned_ids. The inlineIsBatch(),IsCalloc(), andIsCrun()helpers mirror the semantics inStepInstanceon the Craned side, which helps keep Ctld/Supervisor logic consistent.
676-709: New per‑step/task state inCommonStepInCtldandTaskInCtldis consistent with usage in the .cpp.
CommonStepInCtldnow trackstask_res_mapandcraned_task_map, which are exposed inGetStepToDand populated from job allocations and the scheduler.TaskInCtld::IsCrun()andAllExecutionStepsFinished()complement the existing helpers and are used by the updated status‑change logic and scheduler.SchedulePendingSteps’ declaration matches the richer logic implemented in the .cpp.These additions look coherent and keep the public API compact.
Also applies to: 832-921, 958-1037
| ph->on<uvw::end_event>([this, tid = task_id, on_finish]( | ||
| uvw::end_event&, uvw::pipe_handle& h) { | ||
| // EOF Received | ||
| h.close(); | ||
| CRANE_INFO("[Task #{}] Output pipe received EOF.", tid); | ||
| auto& meta = this->m_fwd_meta_map.at(tid); | ||
| meta.output_stopped = true; | ||
| if (meta.err_stopped) on_finish(); | ||
| on_finish(); | ||
| }); | ||
|
|
||
| ph->on<uvw::error_event>([this, tid = meta.task_id, on_finish]( | ||
| uvw::error_event& e, uvw::pipe_handle& h) { | ||
| CRANE_WARN("[Task #{}] output pipe error: {}. Closing.", tid, | ||
| e.what()); | ||
| h.close(); | ||
| auto& meta = this->m_fwd_meta_map.at(tid); | ||
| meta.output_stopped = true; | ||
| if (meta.err_stopped) on_finish(); | ||
| on_finish(); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Double call to on_finish() in stdout pipe handlers.
In both the end_event and error_event handlers for the stdout pipe, on_finish() is called twice - once conditionally (if (meta.err_stopped) on_finish();) and once unconditionally on the next line. This will cause duplicate task cleanup attempts.
🔎 Proposed fix
ph->on<uvw::end_event>([this, tid = task_id, on_finish](
uvw::end_event&, uvw::pipe_handle& h) {
// EOF Received
h.close();
CRANE_INFO("[Task #{}] Output pipe received EOF.", tid);
auto& meta = this->m_fwd_meta_map.at(tid);
meta.output_stopped = true;
- if (meta.err_stopped) on_finish();
- on_finish();
+ if (meta.err_stopped) on_finish();
});
ph->on<uvw::error_event>([this, tid = meta.task_id, on_finish](
uvw::error_event& e, uvw::pipe_handle& h) {
CRANE_WARN("[Task #{}] output pipe error: {}. Closing.", tid,
e.what());
h.close();
auto& meta = this->m_fwd_meta_map.at(tid);
meta.output_stopped = true;
- if (meta.err_stopped) on_finish();
- on_finish();
+ if (meta.err_stopped) on_finish();
});🤖 Prompt for AI Agents
In src/Craned/Supervisor/CforedClient.cpp around lines 282-302, both the
end_event and error_event stdout pipe handlers call on_finish() twice (once
conditionally and once unconditionally), causing duplicate cleanup; fix by
removing the unconditional on_finish() call in each handler so you only set
meta.output_stopped = true and then call on_finish() only if meta.err_stopped is
true (i.e., call on_finish() exactly once when both stdout and stderr are
stopped).
| bool CforedClient::TaskProcessStop(task_id_t task_id, uint32_t exit_code, | ||
| bool signaled) { | ||
| CRANE_DEBUG("[Task #{}] Process stopped with exit_code: {}, signaled: {}.", | ||
| task_id, exit_code, signaled); | ||
| m_task_fwd_req_queue_.enqueue(FwdRequest{ | ||
| .type = StreamTaskIORequest::TASK_EXIT_STATUS, | ||
| .data = TaskFinishStatus{.task_id = task_id, | ||
| .exit_code = exit_code, | ||
| .signaled = signaled}, | ||
| }); | ||
| absl::MutexLock lock(&m_mtx_); | ||
| m_fwd_meta_map[task_id].proc_stopped = true; | ||
| return m_fwd_meta_map[task_id].output_stopped; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing check for task existence in m_fwd_meta_map.
TaskProcessStop directly accesses m_fwd_meta_map[task_id] without checking if the task exists. If called with an unknown task_id, this will create an empty entry and return uninitialized values.
🔎 Proposed fix
bool CforedClient::TaskProcessStop(task_id_t task_id, uint32_t exit_code,
bool signaled) {
CRANE_DEBUG("[Task #{}] Process stopped with exit_code: {}, signaled: {}.",
task_id, exit_code, signaled);
m_task_fwd_req_queue_.enqueue(FwdRequest{
.type = StreamTaskIORequest::TASK_EXIT_STATUS,
.data = TaskFinishStatus{.task_id = task_id,
.exit_code = exit_code,
.signaled = signaled},
});
absl::MutexLock lock(&m_mtx_);
+ auto it = m_fwd_meta_map.find(task_id);
+ if (it == m_fwd_meta_map.end()) {
+ CRANE_ERROR("[Task #{}] Cannot find fwd meta in TaskProcessStop.", task_id);
+ return true; // Assume output stopped if task not found
+ }
- m_fwd_meta_map[task_id].proc_stopped = true;
- return m_fwd_meta_map[task_id].output_stopped;
+ it->second.proc_stopped = true;
+ return it->second.output_stopped && it->second.err_stopped;
}Note: The return value should likely also check err_stopped in addition to output_stopped to be consistent with the new stderr handling model.
| void TaskManager::EvShutdownSupervisorCb_() { | ||
| std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status; | ||
| bool got_final_status = false; | ||
| do { | ||
| got_final_status = m_shutdown_status_queue_.try_dequeue(final_status); | ||
| if (!got_final_status) continue; | ||
| auto& [status, exit_code, reason] = final_status; | ||
| if (m_step_.IsDaemonStep()) { | ||
| CRANE_DEBUG("Sending a {} status as daemon step.", status); | ||
| m_step_.GotNewStatus(status); | ||
| if (!m_step_.orphaned) | ||
| g_craned_client->StepStatusChangeAsync(status, exit_code, reason); | ||
| } | ||
|
|
||
| m_step_.CleanUp(); | ||
|
|
||
| g_craned_client->Shutdown(); | ||
| g_server->Shutdown(); | ||
|
|
||
| this->Shutdown(); | ||
| } while (!got_final_status); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential busy loop in shutdown callback.
The do-while(!got_final_status) loop combined with if (!got_final_status) continue; creates a busy loop that spins until an element is available in the queue. Since this is an async callback triggered by m_shutdown_supervisor_handle_->send(), it should process available elements rather than blocking.
Consider either:
- Changing to a simple
while (m_shutdown_status_queue_.try_dequeue(...))pattern - Or returning early if nothing is in the queue
🔎 Proposed fix
void TaskManager::EvShutdownSupervisorCb_() {
std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status;
- bool got_final_status = false;
- do {
- got_final_status = m_shutdown_status_queue_.try_dequeue(final_status);
- if (!got_final_status) continue;
+ while (m_shutdown_status_queue_.try_dequeue(final_status)) {
auto& [status, exit_code, reason] = final_status;
if (m_step_.IsDaemonStep()) {
CRANE_DEBUG("Sending a {} status as daemon step.", status);
m_step_.GotNewStatus(status);
if (!m_step_.orphaned)
g_craned_client->StepStatusChangeAsync(status, exit_code, reason);
}
m_step_.CleanUp();
g_craned_client->Shutdown();
g_server->Shutdown();
this->Shutdown();
- } while (!got_final_status);
+ }
}🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2246-2267 the shutdown
callback currently uses a do/while with a continue on !got_final_status which
creates a busy spin waiting for queue items; replace this with a non-blocking
drain: first try to dequeue once and return immediately if none are available,
otherwise use a while(m_shutdown_status_queue_.try_dequeue(final_status)) loop
to process each available final_status exactly once (remove the continue and the
do/while), and after draining the queue perform the existing cleanup,
client/server shutdown and this->Shutdown() calls.
fix: step lost Signed-off-by: junlinli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (2)
65-107: Clean up X11 xauth file on early failures inStepInstance::Prepare().In the X11 branch, if
mkstemporfchownfails, the temporary xauth file has already been created butx11_metais never set, and the function returns immediately:int xauth_fd = mkstemp(x11_auth_path.data()); if (xauth_fd == -1) { CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno)); return CraneErrCode::ERR_SYSTEM_ERR; } int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid()); if (ret == -1) { CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno)); return CraneErrCode::ERR_SYSTEM_ERR; }Since
x11_metais only populated after these succeed,CleanUp()cannot remove the stray.Xauthority-XXXXXXfile on these error paths. This matches a previously raised concern.Consider explicitly closing
xauth_fdand unlinkingx11_auth_pathbefore returning in these failure branches so repeated X11 setup failures do not leave garbage files in the user’s home.Also applies to: 112-123
2199-2230: Eliminate potential busy-wait inEvShutdownSupervisorCb_().
EvShutdownSupervisorCb_()uses ado { ... } while (!got_final_status);loop with:got_final_status = m_shutdown_status_queue_.try_dequeue(final_status); if (!got_final_status) continue;If
m_shutdown_supervisor_handle_->send()is ever called without a corresponding enqueued element (or there is a race), this becomes a tight busy loop in the uvw thread, spinning until something shows up in the queue. A non-blocking async callback should just process whatever is available and return.A safer pattern is:
- Try to dequeue once; if nothing is available, return immediately.
- Otherwise, process the dequeued element (or drain all available ones) and then perform shutdown.
For example:
Proposed fix
void TaskManager::EvShutdownSupervisorCb_() { std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status; - bool got_final_status = false; - do { - got_final_status = m_shutdown_status_queue_.try_dequeue(final_status); - if (!got_final_status) continue; + if (!m_shutdown_status_queue_.try_dequeue(final_status)) { + return; + } + // Optionally: while (m_shutdown_status_queue_.try_dequeue(final_status)) { ... } auto& [status, exit_code, reason] = final_status; if (m_step_.IsDaemonStep()) { CRANE_DEBUG("Sending a {} status as daemon step.", status); m_step_.GotNewStatus(status); if (!m_step_.orphaned) g_craned_client->StepStatusChangeAsync(status, exit_code, reason); } m_step_.CleanUp(); g_craned_client->Shutdown(); g_server->Shutdown(); this->Shutdown(); - } while (!got_final_status); }
🧹 Nitpick comments (1)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
121-121: Naming inconsistency: use lowercasekprefix.The constant
KCrunExitCodeStatusNumuses an uppercaseKprefix, while all other constants in theExitCodenamespace use a lowercasekprefix (e.g.,kExitStatusNum,kTerminationSignalBase). Consider renaming tokCrunExitCodeStatusNumfor consistency.🔎 Proposed fix
-constexpr size_t KCrunExitCodeStatusNum = 128; +constexpr size_t kCrunExitCodeStatusNum = 128;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/Utilities/PublicHeader/include/crane/PublicHeader.h
🧰 Additional context used
🧠 Learnings (30)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/CtldPublicDefs.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/Craned/Supervisor/TaskManager.h
🧬 Code graph analysis (3)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (4)
CreateFoldersForFileEx(109-148)CreateFoldersForFileEx(109-110)DeleteFile(64-71)DeleteFile(64-64)src/Craned/Core/StepInstance.cpp (6)
ec(278-278)ec(297-297)GotNewStatus(336-408)GotNewStatus(336-336)Prepare(83-90)Prepare(83-83)
src/CraneCtld/CtldPublicDefs.h (1)
src/CraneCtld/CtldPublicDefs.cpp (8)
SetCranedIds(155-159)SetCranedIds(155-155)SetCranedIds(1138-1141)SetCranedIds(1138-1138)StepStatusChange(469-611)StepStatusChange(470-475)StepStatusChange(873-1069)StepStatusChange(874-879)
src/Craned/Supervisor/TaskManager.h (3)
src/Craned/Supervisor/TaskManager.cpp (14)
GotNewStatus(223-298)GotNewStatus(223-223)IsBatch(156-158)IsBatch(156-156)IsCrun(164-167)IsCrun(164-164)SupervisorFinishInit(2037-2039)SupervisorFinishInit(2037-2037)ShutdownSupervisorAsync(2045-2052)ShutdownSupervisorAsync(2045-2047)EvGrpcCheckStatusCb_(2636-2641)EvGrpcCheckStatusCb_(2636-2636)EvGrpcMigrateSshProcToCgroupCb_(2643-2672)EvGrpcMigrateSshProcToCgroupCb_(2643-2643)src/Craned/Common/CgroupManager.h (1)
CgroupInterface(389-421)src/Craned/Core/StepInstance.cpp (2)
GotNewStatus(336-408)GotNewStatus(336-336)
🪛 Cppcheck (2.19.0)
src/Craned/Supervisor/TaskManager.cpp
[warning] 539-539: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.
(mismatchingContainerExpression)
🔇 Additional comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (3)
130-139: LGTM! Well-documented exit code ranges.The explicit range markers improve code clarity and align correctly with the existing constants (
kExitStatusNum,kTerminationSignalNum,kCraneExitCodeBase). The math checks out:
- Exit codes: 0-255 (256 values)
- Signal termination: 256-319 (64 values)
- Crane-defined codes: 320+
555-561: LGTM! Standard variant visitor pattern.This is a well-established C++17 idiom for visiting
std::varianttypes. The implementation correctly uses variadic templates and fold expressions to combine multiple callable objects. Placement at the end of the file is appropriate for this utility template.
116-117: ConstantskCrunFwdALLandkCrunFwdNONEare actively used.Both constants are referenced in
src/Craned/Supervisor/TaskManager.cpp(lines 588 and 590) within theCrunParseFilePattern_method, where they compare against I/O forwarding pattern strings. The constants follow naming conventions and are not orphaned.
| switch (this->type) { | ||
| case crane::grpc::Batch: | ||
| step_to_d.mutable_batch_meta()->CopyFrom(StepToCtld().batch_meta()); | ||
| break; | ||
| case crane::grpc::Interactive: | ||
| step_to_d.mutable_interactive_meta()->CopyFrom( | ||
| StepToCtld().interactive_meta()); | ||
| break; | ||
| case crane::grpc::Container: | ||
| step_to_d.mutable_container_meta()->CopyFrom(StepToCtld().container_meta()); | ||
| break; | ||
| default: | ||
| std::unreachable(); | ||
| } | ||
| if (StepToCtld().has_io_meta()) { | ||
| auto* mutable_meta = step_to_d.mutable_io_meta(); | ||
| mutable_meta->CopyFrom(StepToCtld().io_meta()); | ||
| } | ||
| step_to_d.set_sh_script(StepToCtld().sh_script()); | ||
| step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid duplicating io_meta assignment in CommonStepInCtld::GetStepToD.
io_meta is first copied under if (StepToCtld().has_io_meta()) and then unconditionally copied again:
if (StepToCtld().has_io_meta()) {
auto* mutable_meta = step_to_d.mutable_io_meta();
mutable_meta->CopyFrom(StepToCtld().io_meta());
}
step_to_d.set_sh_script(StepToCtld().sh_script());
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta());The second CopyFrom is redundant and also forces step_to_d.io_meta to be present even when StepToCtld().has_io_meta() is false, which subtly changes presence semantics for consumers relying on has_io_meta(). Consider removing the unconditional copy and only propagating io_meta inside the guarded block.
🤖 Prompt for AI Agents
In src/CraneCtld/CtldPublicDefs.cpp around lines 849-869, the code copies
io_meta twice — once guarded by if (StepToCtld().has_io_meta()) and then
unconditionally via step_to_d.mutable_io_meta()->CopyFrom(...), which forces
presence even when absent; remove the unconditional
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta()) so io_meta is only
copied inside the has_io_meta() guard (leave the sh_script assignment and the
type-based meta copies unchanged).
| CraneExpected<pid_t> ProcInstance::ForkCrunAndInitIOfd_() { | ||
| auto* meta = GetCrunMeta_(); | ||
| CRANE_DEBUG("Launch crun task #{} pty: {}", task_id, | ||
| m_parent_step_inst_->pty); | ||
|
|
||
| std::set<int> opened_fd{}; | ||
| const auto open_dev_null = [this]() -> int { | ||
| int dev_null_fd = open("/dev/null", O_RDWR); | ||
| if (dev_null_fd == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to open /dev/null: {}", task_id, | ||
| std::strerror(errno)); | ||
| return -1; | ||
| } | ||
| return dev_null_fd; | ||
| }; | ||
| int stdin_local_fd = -1; | ||
| int stdout_local_fd = -1; | ||
| int stderr_local_fd = -1; | ||
|
|
||
| int to_crun_pipe[2]; | ||
| int from_crun_pipe[2]; | ||
| int from_crun_stdout_pipe[2]; | ||
| int from_crun_stderr_pipe[2]; | ||
| int crun_pty_fd = -1; | ||
|
|
||
| int open_mode = | ||
| GetParentStep().io_meta().open_mode_append() ? O_APPEND : O_TRUNC; | ||
| if (!m_parent_step_inst_->pty) { | ||
| if (pipe(to_crun_pipe) == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}", | ||
| task_id, strerror(errno)); | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| if (meta->fwd_stdin) { | ||
| if (pipe(to_crun_pipe) == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}", | ||
| task_id, std::strerror(errno)); | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| opened_fd.insert(to_crun_pipe[0]); | ||
| opened_fd.insert(to_crun_pipe[1]); | ||
| } else { | ||
| if (meta->parsed_input_file_pattern.empty()) { | ||
| int null_fd = open_dev_null(); | ||
| if (null_fd == -1) { | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| stdin_local_fd = null_fd; | ||
| } else { | ||
| int fd = open(meta->parsed_input_file_pattern.c_str(), O_RDONLY); | ||
| if (fd == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to open input file {}: {}", task_id, | ||
| meta->parsed_input_file_pattern, std::strerror(errno)); | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| stdin_local_fd = fd; | ||
| } | ||
| opened_fd.insert(stdin_local_fd); | ||
| } | ||
|
|
||
| if (pipe(from_crun_pipe) == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}", | ||
| task_id, strerror(errno)); | ||
| close(to_crun_pipe[0]); | ||
| close(to_crun_pipe[1]); | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| if (meta->fwd_stdout) { | ||
| if (pipe(from_crun_stdout_pipe) == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}", | ||
| task_id, strerror(errno)); | ||
| for (int fd : opened_fd) { | ||
| close(fd); | ||
| } | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| opened_fd.insert(from_crun_stdout_pipe[0]); | ||
| opened_fd.insert(from_crun_stdout_pipe[1]); | ||
| } else { | ||
| if (meta->parsed_output_file_pattern.empty()) { | ||
| int null_fd = open_dev_null(); | ||
| if (null_fd == -1) { | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| stdout_local_fd = null_fd; | ||
| } else { | ||
| int fd = open(meta->parsed_output_file_pattern.c_str(), | ||
| O_RDWR | O_CREAT | open_mode, 0644); | ||
| if (fd == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to open output file {}: {}", task_id, | ||
| meta->parsed_output_file_pattern, std::strerror(errno)); | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| stdout_local_fd = fd; | ||
| } | ||
| opened_fd.insert(stdout_local_fd); | ||
| } | ||
|
|
||
| if (meta->fwd_stderr) { | ||
| if (pipe(from_crun_stderr_pipe) == -1) { | ||
| CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}", | ||
| task_id, strerror(errno)); | ||
| for (int fd : opened_fd) { | ||
| close(fd); | ||
| } | ||
| return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR); | ||
| } | ||
| opened_fd.insert(from_crun_stderr_pipe[0]); | ||
| opened_fd.insert(from_crun_stderr_pipe[1]); | ||
| } else { | ||
| if (meta->parsed_error_file_pattern.empty() && | ||
| meta->parsed_output_file_pattern.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close local stdin/stdout/stderr FDs in the parent for non-forwarded crun IO.
ForkCrunAndInitIOfd_() opens several FDs before fork():
stdin_local_fdwhen!meta->fwd_stdinstdout_local_fdwhen!meta->fwd_stdoutstderr_local_fdwhen!meta->fwd_stderr
These are used only in the child (via SetupCrunFwdAtChild_() and dup2) and are never needed in the parent. Currently they are:
- Inserted into
opened_fdfor error cleanup onpid == -1 - Never closed in the successful parent path (only pipe ends are selectively closed in
SetupCrunFwdAtParent_())
Over many crun tasks in a long-lived supervisor this can accumulate unnecessary open file descriptors in the supervisor process.
Consider explicitly closing these local-only descriptors in the parent once fork() succeeds, e.g.:
Proposed fix sketch
pid_t pid = -1;
if (m_parent_step_inst_->pty) {
pid = forkpty(&crun_pty_fd, nullptr, nullptr, nullptr);
...
} else {
pid = fork();
...
}
+ if (pid > 0 && !m_parent_step_inst_->pty) {
+ if (!meta->fwd_stdin && stdin_local_fd != -1) {
+ close(stdin_local_fd);
+ }
+ if (!meta->fwd_stdout && stdout_local_fd != -1) {
+ close(stdout_local_fd);
+ }
+ if (!meta->fwd_stderr && stderr_local_fd != -1) {
+ close(stderr_local_fd);
+ }
+ }
if (pid == -1) {
for (int fd : opened_fd) {
close(fd);
}
}This keeps the child behavior unchanged while preventing FD leaks in the supervisor.
Also applies to: 732-775
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 612-710, the function
ForkCrunAndInitIOfd_() opens stdin_local_fd, stdout_local_fd and stderr_local_fd
for use only by the child but never closes them in the successful parent path,
leaking FDs; after fork() succeeds in the parent branch, explicitly close any of
stdin_local_fd, stdout_local_fd and stderr_local_fd that are >= 0 and were
created for non-forwarded IO (i.e., when meta->fwd_stdin/stdout/stderr are
false) and remove them from opened_fd so they are not left open, while leaving
all child behavior unchanged; apply the same fix to the analogous block around
lines 732-775.
| CraneErrCode ProcInstance::Prepare() { | ||
| auto err = ITaskInstance::Prepare(); | ||
| if (err != CraneErrCode::SUCCESS) return err; | ||
|
|
||
| // Write m_meta_ | ||
| if (m_parent_step_inst_->IsBatch()) { | ||
| // Prepare file output name for batch tasks. | ||
| /* Perform file name substitutions | ||
| * %j - Job ID | ||
| * %u - Username | ||
| * %x - Job name | ||
| */ | ||
| auto meta = std::make_unique<BatchInstanceMeta>(); | ||
| meta->parsed_output_file_pattern = ParseFilePathPattern_( | ||
| GetParentStep().batch_meta().output_file_pattern(), | ||
| GetParentStep().cwd()); | ||
|
|
||
| // If -e / --error is not defined, leave | ||
| // batch_meta.parsed_error_file_pattern empty; | ||
| if (!GetParentStep().batch_meta().error_file_pattern().empty()) { | ||
| meta->parsed_error_file_pattern = ParseFilePathPattern_( | ||
| GetParentStep().batch_meta().error_file_pattern(), | ||
| GetParentStep().cwd()); | ||
| } | ||
|
|
||
| m_meta_ = std::move(meta); | ||
| } else { | ||
| m_meta_ = std::make_unique<CrunInstanceMeta>(); | ||
| } | ||
| std::filesystem::path sh_path = m_parent_step_inst_->script_path.value(); | ||
| m_meta_->parsed_sh_script_path = sh_path; | ||
| if (IsCrun()) { | ||
| auto* meta = GetCrunMeta_(); | ||
| { | ||
| auto [path, fwd] = CrunParseFilePattern_( | ||
| GetParentStep().io_meta().output_file_pattern()); | ||
| meta->fwd_stdout = fwd; | ||
| meta->parsed_output_file_pattern = path; | ||
| } | ||
| { | ||
| auto [path, fwd] = | ||
| CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern()); | ||
| meta->fwd_stdin = fwd; | ||
| meta->parsed_output_file_pattern = path; | ||
| } | ||
| { | ||
| auto [path, fwd] = | ||
| CrunParseFilePattern_(GetParentStep().io_meta().error_file_pattern()); | ||
| meta->fwd_stderr = fwd; | ||
| meta->parsed_error_file_pattern = path; | ||
| } | ||
|
|
||
| } else { | ||
| m_meta_->parsed_output_file_pattern = | ||
| ParseFilePathPattern_(GetParentStep().io_meta().output_file_pattern(), | ||
| GetParentStep().cwd(), true, nullptr); | ||
|
|
||
| // If -e / --error is not defined, leave | ||
| // parsed_error_file_pattern empty; | ||
| m_meta_->parsed_error_file_pattern = | ||
| ParseFilePathPattern_(GetParentStep().io_meta().error_file_pattern(), | ||
| GetParentStep().cwd(), false, nullptr); | ||
|
|
||
| m_meta_->parsed_input_file_pattern = | ||
| ParseFilePathPattern_(GetParentStep().io_meta().input_file_pattern(), | ||
| GetParentStep().cwd(), false, nullptr); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix crun stdin pattern handling: parsed input is written into the wrong field.
In ProcInstance::Prepare(), the crun branch initializes CrunInstanceMeta from io_meta:
{
auto [path, fwd] =
CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern());
meta->fwd_stdin = fwd;
meta->parsed_output_file_pattern = path; // <- here
}path for stdin is mistakenly stored in parsed_output_file_pattern instead of parsed_input_file_pattern. Later, ForkCrunAndInitIOfd_() decides whether to open a local stdin file based on meta->parsed_input_file_pattern.empty(), so non-forwarded stdin ends up being treated as “no input” and redirected to /dev/null.
This breaks per-task stdin redirection for crun steps.
Proposed fix
{
auto [path, fwd] =
CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern());
meta->fwd_stdin = fwd;
- meta->parsed_output_file_pattern = path;
+ meta->parsed_input_file_pattern = path;
}
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.