Skip to content

Conversation

@L-Xiafeng
Copy link
Collaborator

@L-Xiafeng L-Xiafeng commented Dec 11, 2025

Summary by CodeRabbit

  • New Features

    • Exposed I/O metadata and LicenseInfo; added node-event/querying, task exit/X11/status reporting, SSH-to-cgroup migration RPC, and a hostname helper.
  • Bug Fixes

    • Consistent script propagation for tasks/steps; unified running semantics (Configured → Starting); added UID/auth checks for submit/cancel.
  • Refactor

    • Reworked I/O/X11 forwarding and multi-task supervisor flows, added per-task cgroup/status tracking, and consolidated recovery/forwarding pipelines.

✏️ Tip: You can customize this high-level summary in your review settings.

@L-Xiafeng L-Xiafeng added the enhancement New feature or request label Dec 11, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

📝 Walkthrough

Walkthrough

Adds IoMeta and LicenseInfo to protobufs; renames TaskStatus value Configured→Starting; propagates io_meta/sh_script and per-task resource maps through Ctld/Craned flows; replaces SupervisorKeeper with SupervisorStub and introduces StepInstance and per-task supervisor/cgroup/I/O lifecycles; refactors cgroup ID parsing and adds util GetHostname.

Changes

Cohort / File(s) Summary
Protobufs / Public API
protos/PublicDefs.proto, protos/Crane.proto, protos/Supervisor.proto
New IoMeta and LicenseInfo; moved script/IO fields into io_meta; added io_meta, sh_script, task_res_map, per-task/task-x11/exit status IO messages; added QueryNodeEvents and MigrateSshProcToCgroup; deprecated field and enum value rename Configured→Starting.
Ctld core & scheduler
src/CraneCtld/CtldPublicDefs.h, src/CraneCtld/CtldPublicDefs.cpp, src/CraneCtld/TaskScheduler.h, src/CraneCtld/TaskScheduler.cpp
Switched craned ID container to std::vector; added ResAllocInfo/res_allocate_expt and per-task task_res_map/craned_task_map; StepStatusChange now returns optional<pair<TaskStatus,uint32_t>>; added HandleUnsetOptionalInStepToCtld to default IO fields and propagate io_meta/sh_script; signature changes for TerminateRunningStepNoLock_.
DB / Ctld RPC surface
src/CraneCtld/DbClient.cpp, src/CraneCtld/EmbeddedDbClient.cpp, src/CraneCtld/RpcService/CtldGrpcServer.cpp, src/CraneCtld/RpcService/CtldGrpcServer.h
Unified document builders to use sh_script() from Step/Task; recovery treats Starting as running; added UID auth checks for CancelTask and SubmitBatchTasks; updated StepResAllocArgs naming to res_info/allocated_res_info_expt.
Cgroup parsing & API
src/Craned/Common/CgroupManager.h, src/Craned/Common/CgroupManager.cpp
Replaced primitive job_id_t keys with CgroupStrParsedIds tuple (job, step, system flag, task); added CgroupStrByParsedIds; changed GetIds/GetMaps/BPF APIs to use parsed-id types; added SetCpuBind and KillAllProcesses(int signum) and adjusted parsing/error messages and recovery callsites.
Craned core: recovery & supervisor stub
src/Craned/Core/Craned.cpp, src/Craned/Core/CranedForPamServer.cpp, src/Craned/Core/CranedServer.cpp, src/Craned/Core/SupervisorStub.h, src/Craned/Core/SupervisorStub.cpp, src/Craned/Core/StepInstance.h, src/Craned/Core/StepInstance.cpp, src/Craned/Core/CMakeLists.txt
Replaced SupervisorKeeper with SupervisorStub; added SupervisorStub::InitAndGetRecoveredMap; added StepInstance (prepare/spawn/cleanup/GotNewStatus/ExecuteStepAsync); recovery now uses parsed-id maps and per-(job,step) SupervisorRecoverInfo; adjusted socket chmod timing and QuerySshStepEnvVariables source.
JobManager / lifecycle
src/Craned/Core/JobManager.h, src/Craned/Core/JobManager.cpp
Removed in-header StepInstance struct; added ChangeStepTimelimit and QuerySshStepEnvVariables APIs; refactored prepare/spawn/termination flows to use StepInstance/GotNewStatus; removed KillPid_/SpawnSupervisor_.
Supervisor internals (craned)
src/Craned/Supervisor/TaskManager.h, src/Craned/Supervisor/TaskManager.cpp, src/Craned/Supervisor/Supervisor.cpp, src/Craned/Supervisor/SupervisorPublicDefs.h, src/Craned/Supervisor/SupervisorServer.h, src/Craned/Supervisor/SupervisorServer.cpp
Reworked StepInstance to support per-step X11 meta, task_ids from step.task_res_map(), status tracking and GotNewStatus; TaskManager supports multiple TaskInstances per step, ITaskInstance lifecycle, async hooks (SupervisorFinishInit, ShutdownSupervisorAsync, CheckStatusAsync, MigrateSshProcToCgroupAsync) and event callbacks; added Supervisor RPC handler for MigrateSshProcToCgroup; removed global runtime status.
Supervisor cfored / X11 / IO forwarding
src/Craned/Supervisor/CforedClient.h, src/Craned/Supervisor/CforedClient.cpp
Reworked X11 forwarding to per-connection local IDs and X11FdInfo map; introduced unified FwdRequest queue with variant payloads (stdout/stderr/X11/exit); updated signatures for InitUvX11FwdHandler, InitFwdMetaAndUvStdoutFwdHandler, TaskOutPutForward; added TaskErrOutPutForward, TaskX11ConnectForward, TaskX11OutPutForward, TaskX11OutputFinish.
Supervisor task lifecycle & TaskManager runtime
src/Craned/Supervisor/TaskManager.*
Large refactor: per-task cgroups, per-task ITaskInstance Prepare/Cleanup/InitEnvMap, updated file-pattern parsing, xauth/X11 setup/cleanup, SSH process migration to cgroups, new async helpers and event-driven lifecycle handling.
Utilities / Public headers
src/Utilities/PublicHeader/include/crane/OS.h, src/Utilities/PublicHeader/OS.cpp, src/Utilities/PublicHeader/include/crane/PublicHeader.h, src/Utilities/PublicHeader/include/crane/String.h
Added util::os::GetHostname() returning CraneExpected<string>; added Crun exit-code constant and public VariantVisitor template + deduction guide; strengthened range/concept constraints for string utilities; changed displayed step status string "Configured""Starting".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client as Client/API
  participant Scheduler as TaskScheduler
  participant Ctld as Ctld
  participant Craned as Craned
  participant SupervisorStub as SupervisorStub
  participant Supervisor as Supervisor/TaskManager

  Note over Client,Scheduler: Submit step/task (includes io_meta, sh_script)
  Client->>Scheduler: SubmitStep(StepToCtld with io_meta, sh_script)
  Scheduler->>Scheduler: HandleUnsetOptionalInStepToCtld (default IO fields)
  Scheduler->>Ctld: AcquireStepAttributes / Queue step
  Ctld->>Craned: StepToD (includes io_meta, sh_script, task_res_map)
  Craned->>SupervisorStub: InitAndGetRecoveredMap / Create SupervisorStub
  Craned->>SupervisorStub: Spawn/ExecuteStep (per-step)
  SupervisorStub->>Supervisor: ExecuteStep → spawn per-task instances
  Supervisor->>CforedClient: Setup I/O/X11 forwarding (local IDs, FwdRequest queue)
  Supervisor->>CgroupManager: Allocate per-step / per-task cgroups
  Supervisor->>Ctld: StepStatusChangeAsync (final status/exit when ready)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

Possibly related PRs

Suggested reviewers

  • NamelessOIer
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.19% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main feature: batch job support for input/output/error file pattern options, which aligns with the extensive changes throughout the codebase for IO handling.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/io

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@L-Xiafeng L-Xiafeng marked this pull request as ready for review December 16, 2025 05:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/Utilities/PublicHeader/OS.cpp (1)

277-285: Consider refactoring GetNodeInfo to use this function.

The GetNodeInfo function (lines 36-42) contains duplicate gethostname() logic. You could refactor it to call GetHostname() instead:

 bool GetNodeInfo(NodeSpecInfo* info) {
   if (!info) return false;
 
-  char hostname[HOST_NAME_MAX + 1];
-  if (gethostname(hostname, sizeof(hostname)) != 0) {
-    int err = errno;
-    fmt::print(stderr, "gethostname failed: errno={} ({})\n", err,
-               strerror(err));
+  auto hostname_result = GetHostname();
+  if (!hostname_result) {
     return false;
   }
+  std::string hostname = std::move(*hostname_result);

This would centralize the hostname retrieval logic and improve consistency. Based on learnings, you may prefer to defer this until more usage patterns emerge.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8418cc0 and 2ea3eec.

📒 Files selected for processing (10)
  • protos/PublicDefs.proto (4 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (4 hunks)
  • src/CraneCtld/CtldPublicDefs.h (2 hunks)
  • src/CraneCtld/DbClient.cpp (4 hunks)
  • src/CraneCtld/TaskScheduler.cpp (3 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (4 hunks)
  • src/Craned/Supervisor/TaskManager.h (2 hunks)
  • src/Utilities/PublicHeader/OS.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/OS.h (1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/Utilities/PublicHeader/OS.cpp
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-15T02:17:20.669Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 477
File: src/Craned/TaskManager.cpp:781-798
Timestamp: 2025-04-15T02:17:20.669Z
Learning: The user prefers to implement separate stderr pattern handling for crun tasks only when there's a specific requirement for it, rather than implementing it proactively. The current implementation only supports redirecting stdout to a file with the `--output` option.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (5)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetHostname (277-285)
  • GetHostname (277-277)
src/CraneCtld/TaskScheduler.h (2)
src/CraneCtld/TaskScheduler.cpp (2)
  • HandleUnsetOptionalInStepToCtld (4276-4284)
  • HandleUnsetOptionalInStepToCtld (4276-4277)
src/CraneCtld/CtldPublicDefs.h (2)
  • StepInCtld (647-671)
  • StepInCtld (673-1098)
src/CraneCtld/CtldPublicDefs.h (1)
src/Craned/Supervisor/TaskManager.cpp (6)
  • IsBatch (74-76)
  • IsBatch (74-74)
  • IsCalloc (87-90)
  • IsCalloc (87-87)
  • IsCrun (82-85)
  • IsCrun (82-82)
src/CraneCtld/TaskScheduler.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetHostname (277-285)
  • GetHostname (277-277)
src/CraneCtld/CtldPublicDefs.cpp (1)
  • result (66-119)
🪛 Cppcheck (2.18.0)
src/Craned/Supervisor/TaskManager.cpp

[warning] 330-330: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.

(mismatchingContainerExpression)

🔇 Additional comments (31)
src/Utilities/PublicHeader/include/crane/OS.h (1)

84-85: LGTM!

The function declaration is well-placed and uses the modern CraneExpected pattern for error handling.

src/Utilities/PublicHeader/OS.cpp (1)

277-285: Implementation is correct and well-structured.

The function properly handles errors using CraneExpected, uses an appropriately sized buffer, and follows the logging conventions of the codebase.

src/CraneCtld/DbClient.cpp (4)

1883-1883: LGTM!

The script field now correctly sources from the unified task_to_ctld.sh_script() location, consistent with the PR's goal to centralize IO-related metadata.


1911-1912: LGTM!

The type check now correctly extracts container metadata only for Container tasks. The script is sourced from the unified location on line 1908, and the Batch-specific path has been appropriately removed as part of the consolidation.


2006-2006: LGTM!

The script field now sources from the unified step->StepToCtld().sh_script() location, eliminating the conditional Batch-specific path and aligning with the centralized IO metadata design.


2076-2076: LGTM!

The script field now sources from the unified step_to_ctld.sh_script() location, consistent with the refactoring to centralize script content and IO metadata.

src/CraneCtld/TaskScheduler.h (1)

899-899: LGTM! Consistent addition for step-level IO metadata handling.

The new HandleUnsetOptionalInStepToCtld method mirrors the existing HandleUnsetOptionalInTaskToCtld pattern and properly handles default values for optional IO metadata fields in steps.

src/CraneCtld/CtldPublicDefs.cpp (3)

399-400: LGTM! Clear documentation of daemon step behavior.

The comments correctly document that daemon steps don't require batch/interactive/IO metadata or script content, which helps maintain clarity in the step initialization logic.


682-685: LGTM! Correct IO metadata and script propagation.

The code properly checks for has_io_meta() before copying, and unconditionally copies sh_script (which is safe for string fields). This ensures IO configuration and script content are correctly propagated from job to primary step.


806-810: LGTM! Consistent propagation to StepToD.

The IO metadata and script handling mirrors the pattern in InitPrimaryStepFromJob (lines 682-685), maintaining consistency in how these fields are transferred from StepToCtld to StepToD.

src/CraneCtld/CtldPublicDefs.h (2)

550-560: LGTM! Useful helper methods following established patterns.

The new IsBatch(), IsCalloc(), and IsCrun() methods provide convenient type checking for steps and follow the same pattern as the existing TaskInCtld helpers (lines 828-834). The logic correctly identifies step types by checking both the type field and the interactive_type subfield where applicable.

Note: Similar helpers exist in StepInstance (src/Craned/Supervisor/TaskManager.cpp lines 73-89), creating some code duplication. Based on learnings, you prefer to defer refactoring duplicates until they become a larger pattern.


835-839: LGTM! Completes the helper method set for TaskInCtld.

The IsCrun() addition to TaskInCtld complements the existing IsCalloc() method and follows the same implementation pattern. This provides symmetry with the step-level helpers added above.

src/CraneCtld/TaskScheduler.cpp (3)

2828-2836: LGTM! Proper integration of the new handler with error propagation.

The chained validation flow correctly:

  1. Initializes optional IO fields via HandleUnsetOptionalInStepToCtld
  2. Acquires step attributes
  3. Checks step validity
  4. Propagates the first error encountered via std::unexpected

4078-4087: LGTM! Migration from batch_meta to io_meta for batch tasks.

The refactoring correctly moves the open_mode_append default initialization from batch_meta to the new io_meta structure, aligning with the centralized IO metadata pattern.


4276-4284: Verify the conditional logic difference between task and step handlers.

The task handler checks task->IsBatch() before setting IO defaults, but the step handler checks step->StepToCtld().has_io_meta(). This means:

  • Tasks: Defaults are set for all batch tasks (even if io_meta is empty)
  • Steps: Defaults are only set if io_meta is explicitly provided

Is this intentional? If a batch step doesn't explicitly provide io_meta, it won't get the open_mode_append default applied.

Consider whether the step handler should also check the step type (e.g., step->IsBatch() if such a method exists) for consistency:

 CraneExpected<void> TaskScheduler::HandleUnsetOptionalInStepToCtld(
     StepInCtld* step) {
-  if (step->StepToCtld().has_io_meta()) {
+  if (step->IsBatch() && step->StepToCtld().has_io_meta()) {
     auto* io_meta = step->MutableStepToCtld()->mutable_io_meta();
     if (!io_meta->has_open_mode_append())
       io_meta->set_open_mode_append(g_config.JobFileOpenModeAppend);
   }
   return {};
 }
protos/PublicDefs.proto (5)

162-163: LGTM! IO metadata centralization at the task level.

Moving io_meta and sh_script to TaskToCtld allows unified access to IO configuration regardless of task type, while still maintaining the type-specific payload via the oneof field.


242-244: LGTM! Consistent structure with TaskToCtld.

The step message mirrors the task message structure for IO metadata, enabling consistent handling across task and step submission paths.


324-325: LGTM! IO metadata propagated to daemon communication.

Adding io_meta and sh_script to StepToD ensures the IO configuration reaches the craned for execution.


360-366: LGTM! Well-structured IO metadata message.

The TaskIoMeta message appropriately:

  • Uses optional for open_mode_append to enable presence tracking
  • Provides fields for input/output/error file patterns as indicated by the PR title

368-370: Verify deployment compatibility for this breaking change.

BatchTaskAdditionalMeta has been significantly restructured:

  • Removed: sh_script, open_mode_append (moved to TaskIoMeta/parent)
  • Changed: interpreter is now field 1

Ensure all components (CraneCtld, Craned, clients) are deployed atomically. Rolling deployments with mixed versions could cause data loss or parsing failures for batch task metadata.

src/Craned/Supervisor/TaskManager.h (3)

173-175: LGTM: IO pattern fields unified in TaskInstanceMeta.

Moving these fields from BatchInstanceMeta to TaskInstanceMeta aligns with the PR's goal of centralizing IO-related metadata. This enables IO pattern handling for both batch and interactive tasks.


178-180: LGTM: BatchInstanceMeta correctly inherits IO pattern fields.

The inheritance structure ensures BatchInstanceMeta retains access to the IO pattern fields now defined in the parent TaskInstanceMeta.


377-379: LGTM: ParseFilePathPattern_ signature enhanced with is_stdout flag.

The additional is_stdout parameter enables differentiated handling for stdout patterns (which get defaults) versus stderr/stdin patterns (which remain empty when unspecified).

src/Craned/Supervisor/TaskManager.cpp (8)

53-53: LGTM: Unified script access simplifies code.

Using the unified sh_script() field removes the need for type-specific branching, aligning with the PR's metadata reorganization.


256-269: LGTM: Default path logic correctly differentiates stdout from other IO patterns.

When the pattern is empty, stdout receives a default path (Crane-<JobID>.<StepID>.out in cwd), while stdin/stderr remain empty. This aligns with common batch system behavior.


271-275: Clarify the backslash escape handling.

The code removes all backslashes from the pattern and returns immediately. This appears to be an escape mechanism, but it's unclear if this is the intended behavior:

  • Should \\ become an empty string?
  • Should this support escaping specific characters like \% to produce a literal %?

Can you clarify the intended behavior of backslash handling in file patterns? The current implementation removes all backslashes, which may not provide the flexibility users expect for escaping special characters.


318-367: Address iterator mismatch warning from static analysis.

The static analysis tool flags line 330 for using iterators from different expressions (input and groups[0]). While this code likely works correctly with RE2's API, the warning suggests potential undefined behavior.

Verify that the RE2 library guarantees groups[0] iterators are valid sub-ranges of input. The code assumes groups[0].begin() points into the same string as input.begin(), which should be documented in RE2's API contract.

As per static analysis hints from Cppcheck.


571-588: LGTM: Stdin handling properly implemented.

The code correctly handles two cases:

  1. Empty stdin pattern: Closes STDIN_FILENO for batch tasks to prevent blocking on input
  2. Non-empty pattern: Opens the specified input file and redirects it to STDIN_FILENO

The logging provides clear visibility into stdin configuration.


593-621: LGTM: Enhanced stdout/stderr logging improves debuggability.

The task-specific logging messages clearly indicate which file is being used for stdout and stderr redirection, making troubleshooting easier.


1264-1279: LGTM: IO pattern parsing correctly uses unified io_meta() and is_stdout flag.

The code properly:

  1. Uses io_meta() for all three patterns (output, error, input) instead of type-specific metadata
  2. Sets is_stdout=true only for output_file_pattern to enable default path generation
  3. Sets is_stdout=false for error and input patterns, which remain empty when unspecified
  4. Conditionally parses error and input patterns only when non-empty

277-290: GetHostname() implementation has adequate error handling.

The util::os::GetHostname() function properly handles edge cases: it uses a correctly-sized buffer (HOST_NAME_MAX + 1), checks the gethostname() return value, logs errors with errno details, and returns a CraneExpected type that signals failures. When hostname resolution fails in TaskManager.cpp, the code correctly checks for the error condition with if (!hostname_expt) and falls back to returning the pattern unmodified. This is intentional design—the function does not leave the system in an undefined state. While placeholders like %N may remain unreplaced on hostname resolution failure, this is a known limitation of the current approach rather than a defect in the GetHostname() implementation itself.

Comment on lines +284 to +494
auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
if (node_it == grpc_nodelist.end()) {
CRANE_ERROR("Failed to find current node {} in nodelist",
g_config.CranedIdOfThisNode);
return path_to_parse;
}
int node_index = std::distance(grpc_nodelist.begin(), node_it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle missing node in nodelist gracefully.

If the current node is not found in the nodelist (line 285), an error is logged but the function returns path_to_parse with unreplaced %n placeholders. This could lead to file paths with literal %n in them.

Consider handling this case more gracefully:

   auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
   if (node_it == grpc_nodelist.end()) {
     CRANE_ERROR("Failed to find current node {} in nodelist",
                 g_config.CranedIdOfThisNode);
-    return path_to_parse;
+    // Continue with node_index = 0 as fallback
+    node_index = 0;
+  } else {
+    node_index = std::distance(grpc_nodelist.begin(), node_it);
   }
-  int node_index = std::distance(grpc_nodelist.begin(), node_it);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
if (node_it == grpc_nodelist.end()) {
CRANE_ERROR("Failed to find current node {} in nodelist",
g_config.CranedIdOfThisNode);
return path_to_parse;
}
int node_index = std::distance(grpc_nodelist.begin(), node_it);
auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
int node_index;
if (node_it == grpc_nodelist.end()) {
CRANE_ERROR("Failed to find current node {} in nodelist",
g_config.CranedIdOfThisNode);
// Continue with node_index = 0 as fallback
node_index = 0;
} else {
node_index = std::distance(grpc_nodelist.begin(), node_it);
}
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 284 to 290, when the
current node is not found the function logs an error and returns path_to_parse
with unreplaced %n placeholders; instead, log a warning about the missing node,
set a safe fallback node_index (e.g. 0) to use for placeholder replacement, and
proceed with the normal %n replacement so the returned path contains no literal
"%n" (ensure the warning includes the missing node id and that we continue
rather than returning early).

L-Xiafeng and others added 7 commits December 20, 2025 18:15
Signed-off-by: Li Junlin <[email protected]>

fix: compile

Signed-off-by: Li Junlin <[email protected]>

fix: AI comments

Signed-off-by: Li Junlin <[email protected]>

fix

Signed-off-by: Li Junlin <[email protected]>

fix: Comments

Signed-off-by: Li Junlin <[email protected]>

style: auto format with clang-format.

bugfix X11 multi conn

Signed-off-by: Li Junlin <[email protected]>

fix: X11 multi conn

Signed-off-by: Li Junlin <[email protected]>

WIP: X11

Signed-off-by: Li Junlin <[email protected]>

fix: X11 forwarding

Signed-off-by: Li Junlin <[email protected]>

fix: Subprocess cg

Signed-off-by: Li Junlin <[email protected]>

fix: Only init cfored for ia task

Signed-off-by: Li Junlin <[email protected]>

fix: Exit reason

Signed-off-by: Li Junlin <[email protected]>

fix: Deadlock

Signed-off-by: Li Junlin <[email protected]>

style: auto format with clang-format.

fix

Signed-off-by: Li Junlin <[email protected]>

fix: tasks step exit code

Signed-off-by: Li Junlin <[email protected]>

fix: Share mem for tasks in a same step on node

Signed-off-by: Li Junlin <[email protected]>

feat: Support task execution

Signed-off-by: Li Junlin <[email protected]>

fix: Cgroup recovery

Signed-off-by: Li Junlin <[email protected]>

refactor: Use std::chrono_literals

Signed-off-by: Li Junlin <[email protected]>

refactor

Signed-off-by: Li Junlin <[email protected]>

refactor: PAM

Signed-off-by: Li Junlin <[email protected]>

move user cgroup of daemon step to Supervisor

Signed-off-by: Li Junlin <[email protected]>

Update TaskScheduler.cpp

feat: Step cgroup recovery

Signed-off-by: Li Junlin <[email protected]>

refactor: Supervisor step status.

Signed-off-by: Li Junlin <[email protected]>

refactor: Step instance

Signed-off-by: Li Junlin <[email protected]>

feat: Add CPU binding functionality for cgroup v1 and v2

Signed-off-by: Li Junlin <[email protected]>
Signed-off-by: junlinli <[email protected]>
Signed-off-by: Li Junlin <[email protected]>

fix: submit time,start time, supervisor coredump

Signed-off-by: Li Junlin <[email protected]>
Signed-off-by: junlinli <[email protected]>
Signed-off-by: junlinli <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
src/CraneCtld/TaskScheduler.cpp (1)

1070-1092: Fix capture-by-reference bug in AllocJobs lambda

In the craned_alloc_job_map loop, the lambda passed to m_rpc_worker_pool_->detach_task captures by reference ([&]) and uses craned_id and jobs, which are loop-local variables:

for (auto&& iter : craned_alloc_job_map) {
  CranedId const& craned_id = iter.first;
  std::vector<crane::grpc::JobToD>& jobs = iter.second;

  m_rpc_worker_pool_->detach_task([&] {
    auto stub = g_craned_keeper->GetCranedStub(craned_id);
    // uses jobs...
  });
}

All detached tasks will see the last values of craned_id and jobs, so all RPCs and failure bookkeeping target the same node and job set, which is a correctness and scheduling bug.

Capture craned_id and jobs by value in the lambda instead:

Proposed fix
-      std::latch alloc_job_latch(craned_alloc_job_map.size());
+      std::latch alloc_job_latch(craned_alloc_job_map.size());
       for (auto&& iter : craned_alloc_job_map) {
         CranedId const& craned_id = iter.first;
         std::vector<crane::grpc::JobToD>& jobs = iter.second;

-        m_rpc_worker_pool_->detach_task([&] {
+        m_rpc_worker_pool_->detach_task([&, craned_id, jobs] {
           auto stub = g_craned_keeper->GetCranedStub(craned_id);
           CRANE_TRACE("Send AllocJobs for {} tasks to {}", jobs.size(),
                       craned_id);
           if (stub == nullptr || stub->Invalid()) {
             CRANE_TRACE(
                 "AllocJobs for jobs [{}] to {} failed: Craned down.",
                 absl::StrJoin(
                     jobs | std::views::transform(
                                [](const crane::grpc::JobToD& job_to_d) {
                                  return std::to_string(job_to_d.job_id());
                                }),
                     ","),
                 craned_id);
             absl::MutexLock lk(&thread_pool_mtx);
             for (const auto& job_to_d : jobs)
               failed_job_id_set.emplace(job_to_d.job_id());
             alloc_job_latch.count_down();
             return;
           }
           // ... rest unchanged ...
         });
       }

Also applies to: 1094-1152, 1153-1157, 1159-1210

src/Craned/Supervisor/TaskManager.cpp (1)

725-752: Close stdin_fd after dup2 in batch FD setup

In ProcInstance::SetChildProcBatchFd_, when stdin_file_path is non-empty you open the input file, dup2 it onto STDIN_FILENO, but never close stdin_fd. Since this happens in the child before execv, that extra FD will remain open across exec, which is unnecessary and slightly increases per-process FD usage.

After a successful dup2(stdin_fd, STDIN_FILENO), it’s safe to close(stdin_fd).

Proposed minimal change
    int stdin_fd = open(stdin_file_path.c_str(), O_RDONLY);
    if (stdin_fd == -1) {
      fmt::println(stderr, "[Task #{}] Subprocess error: open stdin: {}. {}\n",
                   task_id, stdin_file_path, strerror(errno));
      return CraneErrCode::ERR_SYSTEM_ERR;
    }
    fmt::println(stderr, "[Task #{}] Subprocess stdin: {}", task_id,
                 stdin_file_path);
    // STDIN -> input file
-    if (dup2(stdin_fd, STDIN_FILENO) == -1) std::exit(1);
+    if (dup2(stdin_fd, STDIN_FILENO) == -1) std::exit(1);
+    close(stdin_fd);
src/Craned/Core/CranedForPamServer.cpp (1)

242-249: Ensure PAM Unix socket permissions are actually tightened

chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600); is currently called before builder.BuildAndStart(), i.e., likely before gRPC creates the Unix socket file. In that case, chmod will fail and the socket keeps its default permissions, but the error is ignored.

To guarantee restrictive permissions:

  • Move the chmod call after m_server_ = builder.BuildAndStart();, and
  • Check the return value and log on failure.

This keeps the PAM socket properly locked down to root as intended by the comment.

Illustrative adjustment
  builder.RegisterService(m_service_impl_.get());
-  // Pam run as root
-  chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600);
-
-  m_server_ = builder.BuildAndStart();
+  m_server_ = builder.BuildAndStart();
+  // Pam runs as root; restrict Unix socket permissions.
+  if (chmod(listen_conf.UnixSocketForPamListenAddr.c_str(), 0600) != 0) {
+    CRANE_WARN("Failed to chmod PAM Unix socket {}: {}",
+               listen_conf.UnixSocketForPamListenAddr, strerror(errno));
+  }
src/CraneCtld/CtldPublicDefs.cpp (1)

467-523: Daemon step status change: interactive-meta handling and primary-step fallback

The reworked DaemonStepInCtld::StepStatusChange nicely covers:

  • Configuring→Running transition with creation of a primary step and population of craned_step_alloc_map.
  • Configure-failed path that frees daemon resources and now returns a (status, exit_code) pair even when the primary step was never created.
  • Interactive jobs: invoking cb_step_cancel and cb_step_completed when daemon configuration fails.

Two small points to consider:

  • In the interactive configure-failed branch you copy InteractiveMeta out of job->meta (auto meta = std::get<InteractiveMeta>(job->meta);). If any later code relies on has_been_cancelled_on_front_end in job->meta, those flag updates won’t persist. If you need the flag later, switch this to a reference (auto& meta = std::get<InteractiveMeta>(job->meta);).
  • When the daemon finishes successfully but PrimaryStepStatus() is still Invalid, you now return the daemon’s own status/exit code, which is a sensible fallback for “daemon-only” jobs; just ensure callers tolerate both primary and daemon-originated pairs.

Also applies to: 547-600

src/Craned/Supervisor/CforedClient.cpp (1)

117-197: X11 connection bookkeeping: consider attaching X11FdInfo to the handle and avoid double-closing

Two small X11-related issues:

  • In SetupX11forwarding_, you construct x11_fd_info and store it in m_x11_fd_info_map_, but sock->data<X11FdInfo>() is never initialized. The end_event/error_event callbacks therefore see a null data<X11FdInfo>() and never actually set sock_stopped. Either drop sock_stopped entirely or call sock->data(x11_fd_info); when accepting the connection so the callbacks operate on real state.
  • In the TASK_X11_INPUT handling path, you pass eof as close_fd to WriteStringToFd_ and also call x11_fd_info->sock->close() on EOF. That results in the underlying socket fd being closed twice (once by close(fd), once by libuv/uvw). It would be safer to pass close_fd = false for X11 and rely solely on sock->close() for lifetime management.

Also applies to: 724-783

src/Craned/Common/CgroupManager.cpp (2)

494-552: Fix file descriptor leak and complete BPF recovery logic in RecoverFromResInNode

RecoverFromResInNode has two issues:

  1. File descriptor leak: Opens cgroup_fd at line 1468 but never closes it before returning at line 1502. Compare with SetDeviceAccess (line 1376) which properly calls close(cgroup_fd) before return.

  2. Incomplete recovery: Only populates m_cgroup_bpf_devices and sets m_bpf_attached_ = true, but doesn't actually restore device filters to the eBPF map or attach the program. Unlike SetDeviceAccess, it skips the bpf_map__update_elem and bpf_prog_attach calls, meaning the recovered cgroup has no active device restrictions.

Add close(cgroup_fd); before the final return true;, and consider mirroring the full SetDeviceAccess logic (updating the BPF map and conditionally attaching the program) to ensure recovery actually restores device access controls.

Point 2 about GetJobBpfMapCgroupsV2_ using CgroupStrParsedIds keys is correct and the CRANE_ASSERT is appropriate.


279-308: Fix RE2 pattern matching to use plain types instead of std::optional pointers

RE2::FullMatch does not support std::optional<T>* arguments—it expects plain types or pointers to plain types. In ParseIdsFromCgroupStr_, capture into plain temporary variables (std::string, uint32_t) instead, then assign into the optionals after a successful match:

-  std::optional<job_id_t> job_id;
-  std::optional<step_id_t> step_id;
-  std::optional<task_id_t> task_id;
-  std::optional<std::string> system_or_user;
+  std::string job_id_str, step_id_str, system_or_user_str, task_id_str;
   CgroupStrParsedIds parsed_ids{};
-  if (RE2::FullMatch(cgroup_str, *cg_pattern, &job_id, &step_id,
-                     &system_or_user, &task_id)) {
-    std::get<CgConstant::KParsedJobIdIdx>(parsed_ids) = job_id;
-    std::get<CgConstant::KParsedStepIdIdx>(parsed_ids) = step_id;
-    std::get<CgConstant::KParsedSystemFlagIdx>(parsed_ids) =
-        system_or_user.has_value() && system_or_user.value() == "system";
-    std::get<CgConstant::KParsedTaskIdIdx>(parsed_ids) = task_id;
+  if (RE2::FullMatch(cgroup_str, *cg_pattern, &job_id_str, &step_id_str,
+                     &system_or_user_str, &task_id_str)) {
+    if (!job_id_str.empty()) {
+      auto& job_id_opt = std::get<CgConstant::KParsedJobIdIdx>(parsed_ids);
+      absl::SimpleAtoi(job_id_str, &job_id_opt.emplace());
+    }
+    if (!step_id_str.empty()) {
+      auto& step_id_opt = std::get<CgConstant::KParsedStepIdIdx>(parsed_ids);
+      absl::SimpleAtoi(step_id_str, &step_id_opt.emplace());
+    }
+    if (!task_id_str.empty()) {
+      auto& task_id_opt = std::get<CgConstant::KParsedTaskIdIdx>(parsed_ids);
+      absl::SimpleAtoi(task_id_str, &task_id_opt.emplace());
+    }
+    if (!system_or_user_str.empty())
+      std::get<CgConstant::KParsedSystemFlagIdx>(parsed_ids) =
+          (system_or_user_str == "system");

This ensures parsed IDs are correctly populated before being used in GetIdsFromCgroupV1_, GetIdsFromCgroupV2_, GetCgInoJobIdMapCgroupV2_, GetIdsByPid, and GetJobBpfMapCgroupsV2_.

♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

457-571: Pattern parsing behavior for missing node index and cppcheck warning

Two points in ProcInstance::ParseFilePathPattern_:

  1. When the current node is not found in nodelist, the function logs an error and returns path_to_parse unchanged. This leaves %n (and potentially other % specifiers) in the final path. If this is not intended, you could fall back to a sane default index (e.g. 0) and still perform replacement so the returned path is always fully expanded. This matches the prior suggestion in earlier review feedback.

  2. Cppcheck’s mismatchingContainerExpression warning on the std::distance(input.begin() + cursor, groups[0].begin()) line is a false positive in this RE2/StringPiece context, but if you want to silence it you could rewrite the prefix-length computation using pointer arithmetic on input.data()/groups[0].data() instead of iterators, which keeps the intent but avoids mixing iterator sources.

🧹 Nitpick comments (19)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

543-549: VariantVisitor utility is appropriate for centralizing std::variant visitation

The VariantVisitor helper and its deduction guide are idiomatic and let call sites (e.g., in TaskScheduler.cpp) compose visitors without local boilerplate. No issues spotted with this addition.

src/CraneCtld/TaskScheduler.h (1)

928-928: Signature change to CommonStepInCtld matches implementation usage*

Updating TerminateRunningStepNoLock_ to accept CommonStepInCtld* (rather than StepInCtld*) is consistent with the implementation, which relies on CommonStepInCtld-specific fields for interactive steps, and matches call sites that only pass common steps.

src/CraneCtld/TaskScheduler.cpp (3)

1668-1693: Interactive-step termination guard and queueing logic look correct

TerminateRunningStepNoLock_ correctly avoids enqueueing duplicate cancel requests for interactive steps by toggling ia_meta.has_been_terminated_on_craned, while always scheduling CancelRunningTaskQueueElem for non-interactive steps across all execution nodes. The function assumes callers hold the appropriate locks (per NoLock_ naming), which is satisfied in the current call sites.


2552-2563: VariantVisitor-based std::visit cleans up cancellation dispatch

Using the shared VariantVisitor helper in std::visit for CancelTaskQueueElem nicely centralizes handling of pending tasks, running tasks, and pending steps, and keeps ownership transfer for each case clear (pending_task_ptr_vec, running_task_craned_id_map, pending_step_ptr_vec). No functional issues spotted here.


2945-2951: Use of StepStatusChange return value and added trace improves observability

Handling the new StepStatusChange return type as an optional {TaskStatus, exit_code} pair and logging

CRANE_TRACE("[Job #{}] Completed with status {}.", task_id,
            job_finished_status.value());

before applying status/exit code to the job improves traceability of job completion. The subsequent end-time adjustment logic remains unchanged and still guards against large drifts beyond the time limit.

src/Utilities/PublicHeader/include/crane/String.h (1)

149-167: Template constraints for StepToD and JobStepsToString are reasonable

Confining StepToDRangeIdString to std::ranges::range<R> where range_value_t<R> == crane::grpc::StepToD, and introducing MapRange plus additional constraints for JobStepsToString, makes these helpers harder to misuse and yields clearer compile-time errors when called with incompatible containers, while preserving existing call patterns in the codebase.

src/Craned/Core/SupervisorStub.cpp (1)

35-92: Supervisor recovery via socket directory scan is well-structured

InitAndGetRecoveredMap():

  • Filters step_*.sock entries under kDefaultSupervisorUnixSockDir via LazyRE2.
  • Spawns one task per socket on g_thread_pool, each creating a SupervisorStub, calling CheckStatus(), and either removing dead sockets or inserting {(job_id, step_id) → {pid, status, stub}} into the map under a mutex.
  • Uses a std::latch sized to files.size() so the function only returns once all probes have completed.

This gives a clear, bounded recovery pass with appropriate logging and cleanup; the function’s CraneExpected error path only triggers on unexpected filesystem exceptions.

src/Craned/Common/CgroupManager.h (1)

53-62: Cpuset controller/file additions and SetCpuBind hooks are coherent

Adding CPUSET_CONTROLLER/CPUSET_CONTROLLER_V2 and their corresponding CPUSET_CPUS/CPUSET_CPUS_V2 entries to Controller/ControllerFile and the string tables keeps all enum→string mappings in sync. The new pure-virtual SetCpuBind in CgroupInterface with overrides in both CgroupV1 and CgroupV2 provides a clear extension point for cpuset binding logic without impacting existing memory/CPU/IO methods.

Also applies to: 78-79, 90-91, 142-149, 167-180, 397-399, 431-432, 459-460

src/Craned/Core/CranedServer.cpp (1)

192-193: Unused system_flag variable.

The system_flag is extracted from the tuple but never used in this function. If it's intentionally unused, consider marking it with [[maybe_unused]] or using a discard pattern.

🔎 Proposed fix
-      auto [job_id_opt, step_id_opt, system_flag, task_id_opt] =
+      auto [job_id_opt, step_id_opt, [[maybe_unused]] system_flag, task_id_opt] =
           pid_to_ids_expt.value();
src/Craned/Core/StepInstance.cpp (1)

251-260: Child process: Consider using _exit() instead of abort() for cgroup migration failure.

Using abort() generates a core dump and SIGABRT, which may be noisy for a controlled failure. Consider using _exit(EXIT_FAILURE) for cleaner shutdown, though abort() aligns with the error handling pattern at line 332.

src/Craned/Core/StepInstance.h (1)

76-78: TerminateStep is declared but marked as not implemented.

Consider adding a TODO comment or removing this declaration if it's not needed for this PR. Having an unimplemented method in the public interface may cause confusion.

Would you like me to open an issue to track the implementation of TerminateStep, or should this declaration be removed for now?

src/CraneCtld/CtldPublicDefs.cpp (2)

155-159: Vector-based CranedIds and node-set reconstruction are coherent with runtime_attr

StepInCtld::SetCranedIds and the updated RecoverFromDb/InitFromJob paths correctly switch to std::vector<CranedId> for ordering while deriving execution/configuring/running node sets from that vector. The runtime attributes are kept in sync, so downstream queries via RuntimeAttr().craned_ids() and ExecutionNodes() stay consistent.

If you ever need deterministic ordering on the node sets for presentation/logging, consider building them as std::set<CranedId> and then materializing to unordered_set only where necessary. Based on learnings, refactoring can be deferred until this pattern grows further.

Also applies to: 233-288, 364-369


678-739: Common step per-task resource mapping and final status propagation

The new logic in CommonStepInCtld looks solid overall:

  • InitPrimaryStepFromJob:
    • Initializes craned_task_map / task_res_map for batch/calloc (single task per node) and crun (per-node × ntasks_per_node), and zeros per-task memory to avoid per-task memory limiting while keeping step-level memory limits. This aligns with the later change in AllocatableResourceAllocator to skip 0-valued memory limits.
  • GetStepToD:
    • Exposes task_res_map via the new task_res_map field and copies the appropriate batch_meta/interactive_meta/container_meta, plus io_meta and sh_script, matching the updated proto surface.
  • StepStatusChange:
    • Correctly treats Configuring as “Starting/Failed/Cancelled until all nodes configured” and then transitions to Running.
    • Properly tears down orphaned steps and daemon steps on configure failure or completion.
    • Uses AllExecutionStepsFinished() to decide when to return a final (PrimaryStepStatus, PrimaryStepExitCode) pair for the whole task.

One behavioral nuance: when a primary step finishes with mixed statuses across nodes, you ultimately set the step’s status/exit code from PrevErrorStatus()/PrevErrorExitCode(), but you still propagate exit_code (from the last node event) into SetPrimaryStepExitCode(exit_code). If you intend the task-level exit code to reflect the aggregated step status, it may be safer to use this->ExitCode() instead. Please double-check which semantics you want at the task level.

Also applies to: 794-861, 913-915, 995-1027, 1055-1058

protos/PublicDefs.proto (1)

109-122: IoMeta and new status/metadata fields align well; confirm TaskStatus numbering history

The introduction of IoMeta and its use across TaskToCtld, StepToCtld, and StepToD (plus moving script content into sh_script) is a clean factoring that matches the Ctld/Supervisor changes and the new per-task task_res_map in StepToD. The new Starting = 8 TaskStatus and LicenseInfo message also look reasonable.

One thing to double-check: ensure enum value 8 was not previously used for a different TaskStatus in older releases, so that persisted data or mixed-version components won’t misinterpret existing values as Starting.

Also applies to: 135-186, 218-255, 311-350, 362-373, 382-395, 823-829

src/CraneCtld/CtldPublicDefs.h (2)

521-560: Step type predicates and vector-based CranedIds are consistent with Supervisor side

StepInCtld’s new IsBatch/IsCalloc/IsCrun helpers mirror the Supervisor’s StepInstance logic and correctly guard interactive_meta() access behind a type == Interactive check. Moving m_craned_ids_ to a std::vector<CranedId> and exposing it via CranedIds() while keeping execution/configuring/running nodes as unordered_set strikes a good balance between ordered lists (for display/regex) and fast membership checks.

If you find yourself often needing just a single “primary” craned id, consider adding a small helper (e.g. PrimaryCranedId()) instead of directly indexing CranedIds().front() in callers, to encode the invariant once. Based on learnings, this can wait until the pattern appears in more places.

Also applies to: 583-597


671-687: Per-task resource bookkeeping in CommonStepInCtld and schedule path

The new members task_res_map and craned_task_map in CommonStepInCtld, plus the SchedulePendingSteps changes, correctly:

  • Track which local task IDs belong to which craned IDs.
  • Build per-task ResourceInNode entries (with memory zeroed so that per-task cgroups don’t enforce their own memory limit).
  • Export this mapping via StepToD::task_res_map.

AllExecutionStepsFinished() matching m_steps_.empty() && !m_primary_step_ is a simple and correct criterion for deciding when all execution steps are done at the task level.

Down the line, you might consider wrapping the task_id_t allocation (the cur_task_id loops) in a small helper to avoid duplicating the same pattern in both InitPrimaryStepFromJob and SchedulePendingSteps.

Also applies to: 829-845, 913-915, 953-1025

src/Craned/Supervisor/CforedClient.cpp (1)

406-469: Unified forwarding queue and AsyncSendRecvThread_ state machine look correct

The new FwdRequest + m_task_fwd_req_queue_ design in CleanOutputQueueAndWriteToStreamThread_ cleanly centralizes all outgoing StreamTaskIORequest payloads (stdout, X11 connect/output/EOF, exit status) and respects the write_pending flag to avoid overlapping writes. The AsyncSendRecvThread_ changes (thread name, TIMEOUT handling, unregister flow, and use of the new X11 input fields) preserve the existing state machine while enabling the new message types.

You could replace the manual m_mtx_.Lock()/Unlock() in the forwarding state with absl::MutexLock for consistency with the rest of the file, but it’s not strictly necessary.

Also applies to: 471-607

src/Craned/Supervisor/CforedClient.h (1)

28-37: Forwarding API and FwdRequest variant are well-structured

The header cleanly exposes the new unified forwarding API (TaskOutPutForward, TaskX11ConnectForward, TaskX11OutPutForward, TaskX11OutputFinish, TaskProcessStop(exit_code, signaled)) and backs it with a single FwdRequest variant/queue, which matches the implementation and proto types.

The x11_id_t alias is currently unused within this class; if it’s not used elsewhere, you could remove it or add a brief comment about its intended external use.

Also applies to: 56-58, 84-103, 106-114, 123-154, 160-167

src/Craned/Supervisor/TaskManager.h (1)

467-467: Minor grammar fix in comment.

🔎 Proposed fix
-  // Should called in uvw thread, otherwise data race may happen.
+  // Should be called in uvw thread, otherwise data race may happen.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ea3eec and d1a1ffa.

📒 Files selected for processing (35)
  • protos/Crane.proto
  • protos/PublicDefs.proto
  • protos/Supervisor.proto
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/DbClient.cpp
  • src/CraneCtld/EmbeddedDbClient.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Common/CgroupManager.cpp
  • src/Craned/Common/CgroupManager.h
  • src/Craned/Core/CMakeLists.txt
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/CranedForPamServer.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Core/JobManager.h
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Core/StepInstance.h
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/Craned/Supervisor/SupervisorPublicDefs.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Supervisor/SupervisorServer.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Utilities/PublicHeader/OS.cpp
  • src/Utilities/PublicHeader/include/crane/OS.h
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
  • src/Utilities/PublicHeader/include/crane/String.h
💤 Files with no reviewable changes (1)
  • src/Craned/Supervisor/SupervisorPublicDefs.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/Utilities/PublicHeader/include/crane/OS.h
  • src/CraneCtld/DbClient.cpp
🧰 Additional context used
🧠 Learnings (41)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Utilities/PublicHeader/OS.cpp
  • src/Craned/Core/CMakeLists.txt
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/JobManager.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/Craned/Core/StepInstance.h
  • src/Craned/Core/CMakeLists.txt
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/CranedForPamServer.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/JobManager.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Core/StepInstance.h
  • src/Craned/Supervisor/SupervisorServer.h
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/JobManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • protos/Crane.proto
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Core/StepInstance.h
  • src/Craned/Core/CMakeLists.txt
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/JobManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Supervisor/Supervisor.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/EmbeddedDbClient.cpp
  • src/Craned/Core/CMakeLists.txt
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/CranedForPamServer.cpp
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/JobManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-09-22T07:13:20.635Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/CraneCtld/TaskScheduler.cpp:2811-2874
Timestamp: 2025-09-22T07:13:20.635Z
Learning: In the CraneSched codebase, there are latch mechanisms used to ensure variable lifetime safety when capturing references in detached thread pool tasks, allowing reference captures that might otherwise be unsafe.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/SupervisorServer.h
  • src/Craned/Core/CranedForPamServer.cpp
  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/JobManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/StepInstance.cpp
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Supervisor/Supervisor.cpp
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/Craned/Core/SupervisorStub.h
  • src/Craned/Core/SupervisorStub.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Craned/Core/CtldClient.cpp
  • src/Utilities/PublicHeader/include/crane/String.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-09-21T11:26:40.935Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/Craned/Core/SupervisorKeeper.cpp:145-147
Timestamp: 2025-09-21T11:26:40.935Z
Learning: In C++17 and later, Class Template Argument Deduction (CTAD) allows `std::shared_ptr stub = std::make_shared<SupervisorStub>();` syntax to be valid, as the compiler can deduce the template parameter from make_shared. Projects using C++17+ don't need explicit template parameters in this context.

Applied to files:

  • src/Craned/Core/SupervisorStub.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/Craned/Core/JobManager.h
  • src/Craned/Common/CgroupManager.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Common/CgroupManager.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Utilities/PublicHeader/include/crane/String.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Common/CgroupManager.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-15T02:17:20.669Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 477
File: src/Craned/TaskManager.cpp:781-798
Timestamp: 2025-04-15T02:17:20.669Z
Learning: The user prefers to implement separate stderr pattern handling for crun tasks only when there's a specific requirement for it, rather than implementing it proactively. The current implementation only supports redirecting stdout to a file with the `--output` option.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-02T09:52:59.318Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:697-701
Timestamp: 2025-04-02T09:52:59.318Z
Learning: When using bpf_map__get_next_key function, memory must be properly allocated (e.g., with std::make_unique<BpfKey>()) before passing the pointer to the function, as it writes the key to the provided memory address.

Applied to files:

  • src/Craned/Core/Craned.cpp
  • src/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (12)
src/CraneCtld/TaskScheduler.cpp (4)
src/Craned/Core/StepInstance.h (1)
  • exit_code (37-37)
src/Craned/Supervisor/CforedClient.h (1)
  • exit_code (143-143)
src/Craned/Supervisor/SupervisorPublicDefs.h (1)
  • exit_code (37-37)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/CranedForPamServer.cpp (2)
  • MigrateSshProcToCgroup (181-205)
  • MigrateSshProcToCgroup (181-184)
src/Craned/Supervisor/SupervisorServer.cpp (2)
  • MigrateSshProcToCgroup (95-103)
  • MigrateSshProcToCgroup (95-98)
src/Craned/Core/SupervisorStub.h (1)
src/Craned/Core/SupervisorStub.cpp (20)
  • SupervisorStub (25-29)
  • SupervisorStub (31-33)
  • InitAndGetRecoveredMap (35-92)
  • InitAndGetRecoveredMap (37-37)
  • ExecuteStep (94-106)
  • ExecuteStep (94-94)
  • QueryStepEnv (108-121)
  • QueryStepEnv (108-108)
  • CheckStatus (123-137)
  • CheckStatus (124-124)
  • TerminateTask (139-158)
  • TerminateTask (139-140)
  • ChangeTaskTimeLimit (160-172)
  • ChangeTaskTimeLimit (160-160)
  • MigrateSshProcToCg (174-186)
  • MigrateSshProcToCg (174-174)
  • ShutdownSupervisor (188-198)
  • ShutdownSupervisor (188-188)
  • InitChannelAndStub_ (200-204)
  • InitChannelAndStub_ (200-200)
src/Craned/Core/JobManager.h (2)
src/Craned/Core/JobManager.cpp (4)
  • ChangeStepTimelimit (486-511)
  • ChangeStepTimelimit (486-488)
  • QuerySshStepEnvVariables (513-529)
  • QuerySshStepEnvVariables (513-514)
src/Craned/Core/CranedServer.cpp (2)
  • QuerySshStepEnvVariables (295-316)
  • QuerySshStepEnvVariables (295-298)
src/Craned/Supervisor/SupervisorServer.cpp (2)
src/Craned/Core/CranedForPamServer.cpp (2)
  • MigrateSshProcToCgroup (181-205)
  • MigrateSshProcToCgroup (181-184)
src/Craned/Core/SupervisorStub.cpp (2)
  • ShutdownSupervisor (188-198)
  • ShutdownSupervisor (188-188)
src/CraneCtld/CtldPublicDefs.h (2)
src/Craned/Supervisor/TaskManager.cpp (6)
  • IsBatch (156-158)
  • IsBatch (156-156)
  • IsCalloc (169-172)
  • IsCalloc (169-169)
  • IsCrun (164-167)
  • IsCrun (164-164)
src/CraneCtld/CtldPublicDefs.cpp (8)
  • SetCranedIds (155-159)
  • SetCranedIds (155-155)
  • SetCranedIds (1128-1131)
  • SetCranedIds (1128-1128)
  • StepStatusChange (467-603)
  • StepStatusChange (468-473)
  • StepStatusChange (863-1059)
  • StepStatusChange (864-869)
src/Craned/Core/StepInstance.cpp (1)
src/Craned/Common/CgroupManager.cpp (4)
  • AllocateAndGetCgroup (494-552)
  • AllocateAndGetCgroup (495-497)
  • CgroupStrByStepId (283-288)
  • CgroupStrByStepId (283-284)
src/Craned/Supervisor/CforedClient.h (1)
src/Craned/Supervisor/CforedClient.cpp (12)
  • InitUvX11FwdHandler (109-115)
  • InitUvX11FwdHandler (109-109)
  • TaskProcessStop (725-738)
  • TaskProcessStop (725-726)
  • TaskOutPutForward (749-756)
  • TaskOutPutForward (749-750)
  • TaskX11ConnectForward (757-762)
  • TaskX11ConnectForward (757-757)
  • TaskX11OutPutForward (764-774)
  • TaskX11OutPutForward (764-766)
  • TaskX11OutputFinish (776-783)
  • TaskX11OutputFinish (776-776)
src/Craned/Core/SupervisorStub.cpp (2)
src/Craned/Core/SupervisorStub.h (1)
  • SupervisorStub (28-61)
src/Craned/Supervisor/TaskManager.h (2)
  • job_id (51-51)
  • step_id (52-52)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetHostname (277-285)
  • GetHostname (277-277)
src/Craned/Common/CgroupManager.cpp (2)
  • AllocateAndGetCgroup (494-552)
  • AllocateAndGetCgroup (495-497)
src/Craned/Core/Craned.cpp (4)
src/Craned/Core/StepInstance.cpp (2)
  • StepInstance (28-33)
  • StepInstance (35-43)
src/Craned/Core/SupervisorStub.h (2)
  • Craned (26-62)
  • SupervisorStub (28-61)
src/Craned/Common/CgroupManager.cpp (6)
  • CgroupStrByParsedIds (296-308)
  • CgroupStrByParsedIds (296-296)
  • CgroupStrByStepId (283-288)
  • CgroupStrByStepId (283-284)
  • CgroupStrByJobId (279-281)
  • CgroupStrByJobId (279-279)
src/Craned/Core/SupervisorStub.cpp (4)
  • SupervisorStub (25-29)
  • SupervisorStub (31-33)
  • InitAndGetRecoveredMap (35-92)
  • InitAndGetRecoveredMap (37-37)
src/Craned/Core/JobManager.cpp (3)
src/Craned/Supervisor/TaskManager.h (2)
  • job_id (51-51)
  • step_id (52-52)
src/CraneCtld/RpcService/CranedKeeper.cpp (3)
  • lock (410-410)
  • lock (480-480)
  • lock (839-839)
src/Craned/Core/CranedServer.cpp (2)
  • QuerySshStepEnvVariables (295-316)
  • QuerySshStepEnvVariables (295-298)
🪛 Cppcheck (2.19.0)
src/Craned/Core/StepInstance.cpp

[error] 142-142: Found an exit path from function with non-void return type that has missing return statement

(missingReturn)

src/Craned/Supervisor/TaskManager.cpp

[warning] 534-534: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.

(mismatchingContainerExpression)

src/Craned/Core/Craned.cpp

[error] 142-142: Found an exit path from function with non-void return type that has missing return statement

(missingReturn)

Comment on lines +1862 to 1869
if (!primary_step) {
CRANE_ERROR(
"[Job #{}] Daemon step not found when cancelling running job",
task_id);
return;
}
TerminateRunningStepNoLock_(daemon_step);
TerminateRunningStepNoLock_(primary_step);
auto& cancelled_job_steps = *reply.mutable_cancelled_steps();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log message still mentions daemon step while now operating on primary step

In the “cancel jobs with filters” branch, the code now uses task->PrimaryStep() as the representative running step:

auto primary_step = task->PrimaryStep();
if (!primary_step) {
  CRANE_ERROR(
      "[Job #{}] Daemon step not found when cancelling running job",
      task_id);
  return;
}
TerminateRunningStepNoLock_(primary_step);

The error message still says “Daemon step not found”, which is misleading if a job intentionally has only a daemon or other common steps. Consider updating the log text to reflect the actual expectation (e.g., “primary step not found” or “no representative step found for running job”), but the functional change to use PrimaryStep() is otherwise sound.

🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1862 to 1869, the error log
still says "Daemon step not found" but the code now uses task->PrimaryStep();
update the log text to accurately reflect the expectation (for example, "Primary
step not found when cancelling running job" or "No representative step found for
running job") keeping the task_id in the message; make this single-string change
where CRANE_ERROR is invoked so the log matches the actual check and early
return.

Comment on lines +112 to +116
constexpr int KParsedJobIdIdx = 0;
constexpr int KParsedStepIdIdx = 1;
constexpr int KParsedSystemFlagIdx = 2;
constexpr int KParsedTaskIdIdx = 3;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Parsed ID tuple refactor is a solid foundation; update nearby docstring

Switching to CgroupStrParsedIds as

using CgroupStrParsedIds =
    std::tuple<std::optional<job_id_t>, std::optional<step_id_t>,
               bool /*true if system step*/, std::optional<task_id_t>>;

plus the KParsed*Idx constants, and updating:

  • GetIdsFromCgroupV1_ / GetIdsFromCgroupV2_ to return std::set<CgroupStrParsedIds>,
  • GetCgInoJobIdMapCgroupV2_ and GetJobBpfMapCgroupsV2_ to key on CgroupStrParsedIds,
  • CgroupStrByParsedIds helper,

gives a much richer and type-safe way to carry job/step/system/task identity through cgroup/BPF plumbing.

One minor nit: the AllocateAndGetCgroup comment still mentions an alloc_mem parameter that no longer exists in the signature; consider updating that doc line to avoid confusion.

Also applies to: 528-553, 610-620, 634-636

🤖 Prompt for AI Agents
In src/Craned/Common/CgroupManager.h around lines 112-116 (also check similar
comments at 528-553, 610-620, 634-636), the AllocateAndGetCgroup docstring still
references an alloc_mem parameter that no longer exists; remove that reference
and update the comment to reflect the current function signature and behavior
(briefly describe returned/allocated resources and ownership semantics if
applicable), and ensure any mention of parsed ID handling aligns with the new
CgroupStrParsedIds tuple usage elsewhere.

Comment on lines +139 to 201
for (auto ids : rn_job_ids_with_cg) {
auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids;
job_id_t job_id = job_id_opt.value();
// For craned, we don't care task cgroup
if (task_id_opt.has_value()) continue;
if (step_id_opt.has_value()) {
step_id_t step_id = step_id_opt.value();
// Step cgroup recovery
if (rn_step_from_ctld.contains({job_id, step_id})) {
// Step is found in both ctld and cgroup
auto& step_instance = rn_step_from_ctld.at({job_id, step_id});

// For common step cgroup, craned only cares about system cgroup
// For daemon step cgroup, craned cares about both system and user
// cgroup
if (!step_instance->IsDaemonStep() && !system_flag) {
continue;
}
CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id,
job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag),
step_instance->step_to_d.res(), true);
if (cg_expt.has_value()) {
if (system_flag)
step_instance->crane_cgroup = std::move(cg_expt.value());
continue;
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR(
"Cgroup for step {} of job #{} is found but not "
"recoverable.",
step_id, job_id);
}
} else {
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id,
step_id);
clean_invalid_cg(ids);
}
}

// Job cgroup recovery
if (rn_jobs_from_ctld.contains(job_id)) {
// Job is found in both ctld and cgroup
JobInD& job = rn_jobs_from_ctld.at(job_id);

CRANE_DEBUG("Recover existing cgroup for job #{}", job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true);
if (cg_expt.has_value()) {
// Job cgroup is recovered.
job.cgroup = std::move(cg_expt.value());
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id);
}
continue;
}
clean_invalid_cg(ids);
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Logic issue: step entries may trigger incorrect cleanup in job recovery block.

When processing step entries, the code falls through to the job recovery section (lines 181-200). This causes two problems:

  1. Double cleanup: For step entries where the step is NOT in ctld (lines 173-178), clean_invalid_cg(ids) is called for the step, then potentially called again at line 198 with the same step ids if the parent job is also not in ctld.

  2. Misleading log: Line 200 logs "Removing cgroup for job" but ids may be a step entry.

Consider adding a continue after the step cleanup at line 177, or restructure the loop to handle job entries and step entries separately.

🔎 Proposed fix
       } else {
         // Job is found in cgroup but not in ctld. Cleaning up.
         CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id,
                     step_id);
         clean_invalid_cg(ids);
+        continue;
       }
     }
 
     // Job cgroup recovery
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (auto ids : rn_job_ids_with_cg) {
auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids;
job_id_t job_id = job_id_opt.value();
// For craned, we don't care task cgroup
if (task_id_opt.has_value()) continue;
if (step_id_opt.has_value()) {
step_id_t step_id = step_id_opt.value();
// Step cgroup recovery
if (rn_step_from_ctld.contains({job_id, step_id})) {
// Step is found in both ctld and cgroup
auto& step_instance = rn_step_from_ctld.at({job_id, step_id});
// For common step cgroup, craned only cares about system cgroup
// For daemon step cgroup, craned cares about both system and user
// cgroup
if (!step_instance->IsDaemonStep() && !system_flag) {
continue;
}
CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id,
job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag),
step_instance->step_to_d.res(), true);
if (cg_expt.has_value()) {
if (system_flag)
step_instance->crane_cgroup = std::move(cg_expt.value());
continue;
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR(
"Cgroup for step {} of job #{} is found but not "
"recoverable.",
step_id, job_id);
}
} else {
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id,
step_id);
clean_invalid_cg(ids);
}
}
// Job cgroup recovery
if (rn_jobs_from_ctld.contains(job_id)) {
// Job is found in both ctld and cgroup
JobInD& job = rn_jobs_from_ctld.at(job_id);
CRANE_DEBUG("Recover existing cgroup for job #{}", job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true);
if (cg_expt.has_value()) {
// Job cgroup is recovered.
job.cgroup = std::move(cg_expt.value());
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id);
}
continue;
}
clean_invalid_cg(ids);
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id);
}
for (auto ids : rn_job_ids_with_cg) {
auto [job_id_opt, step_id_opt, system_flag, task_id_opt] = ids;
job_id_t job_id = job_id_opt.value();
// For craned, we don't care task cgroup
if (task_id_opt.has_value()) continue;
if (step_id_opt.has_value()) {
step_id_t step_id = step_id_opt.value();
// Step cgroup recovery
if (rn_step_from_ctld.contains({job_id, step_id})) {
// Step is found in both ctld and cgroup
auto& step_instance = rn_step_from_ctld.at({job_id, step_id});
// For common step cgroup, craned only cares about system cgroup
// For daemon step cgroup, craned cares about both system and user
// cgroup
if (!step_instance->IsDaemonStep() && !system_flag) {
continue;
}
CRANE_DEBUG("Recover existing cgroup for step {} of job #{}", step_id,
job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByStepId(job_id, step_id, system_flag),
step_instance->step_to_d.res(), true);
if (cg_expt.has_value()) {
if (system_flag)
step_instance->crane_cgroup = std::move(cg_expt.value());
continue;
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR(
"Cgroup for step {} of job #{} is found but not "
"recoverable.",
step_id, job_id);
}
} else {
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for step #{}.{} not in Ctld.", job_id,
step_id);
clean_invalid_cg(ids);
continue;
}
}
// Job cgroup recovery
if (rn_jobs_from_ctld.contains(job_id)) {
// Job is found in both ctld and cgroup
JobInD& job = rn_jobs_from_ctld.at(job_id);
CRANE_DEBUG("Recover existing cgroup for job #{}", job_id);
auto cg_expt = CgroupManager::AllocateAndGetCgroup(
CgroupManager::CgroupStrByJobId(job_id), job.job_to_d.res(), true);
if (cg_expt.has_value()) {
// Job cgroup is recovered.
job.cgroup = std::move(cg_expt.value());
} else {
// If the cgroup is found but is unrecoverable, just logging out.
CRANE_ERROR("Cgroup for job #{} is found but not recoverable.", job_id);
}
continue;
}
clean_invalid_cg(ids);
// Job is found in cgroup but not in ctld. Cleaning up.
CRANE_DEBUG("Removing cgroup for job #{} not in Ctld.", job_id);
}
🧰 Tools
🪛 Cppcheck (2.19.0)

[error] 142-142: Found an exit path from function with non-void return type that has missing return statement

(missingReturn)

🤖 Prompt for AI Agents
In src/Craned/Core/Craned.cpp around lines 139-201 the loop handling
rn_job_ids_with_cg processes step entries then falls through into the
job-recovery block causing double clean_invalid_cg calls and a misleading
"Removing cgroup for job" log for step entries; after the branch that handles
"Job is found in cgroup but not in ctld" for steps (where clean_invalid_cg(ids)
is called around line 177) add a continue to skip the subsequent job-recovery
logic, and ensure the "Removing cgroup for job #{} not in Ctld." debug log is
only emitted for pure job entries (i.e., place or guard that log so it does not
run for step entries).

Comment on lines +305 to +313
auto task_env_map =
g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId);
if (task_env_map.error()) {
CRANE_ERROR("Failed to get step env of job #{}", request->task_id());
return Status::OK;
}

auto task_env_map = stub->QueryStepEnv();
if (task_env_map.has_value()) {
for (const auto &[name, value] : task_env_map.value())
response->mutable_env_map()->emplace(name, value);
response->set_ok(true);
}
for (const auto &[name, value] : task_env_map.value())
response->mutable_env_map()->emplace(name, value);
response->set_ok(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect use of .error() for checking CraneExpected failure.

CraneExpected<EnvMap>::error() returns the error value, not a boolean indicating failure. Use !task_env_map.has_value() or check the negation operator instead.

🔎 Proposed fix
   auto task_env_map =
       g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId);
-  if (task_env_map.error()) {
+  if (!task_env_map.has_value()) {
     CRANE_ERROR("Failed to get step env of job #{}", request->task_id());
     return Status::OK;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto task_env_map =
g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId);
if (task_env_map.error()) {
CRANE_ERROR("Failed to get step env of job #{}", request->task_id());
return Status::OK;
}
auto task_env_map = stub->QueryStepEnv();
if (task_env_map.has_value()) {
for (const auto &[name, value] : task_env_map.value())
response->mutable_env_map()->emplace(name, value);
response->set_ok(true);
}
for (const auto &[name, value] : task_env_map.value())
response->mutable_env_map()->emplace(name, value);
response->set_ok(true);
auto task_env_map =
g_job_mgr->QuerySshStepEnvVariables(request->task_id(), kDaemonStepId);
if (!task_env_map.has_value()) {
CRANE_ERROR("Failed to get step env of job #{}", request->task_id());
return Status::OK;
}
for (const auto &[name, value] : task_env_map.value())
response->mutable_env_map()->emplace(name, value);
response->set_ok(true);
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 305-313, the code incorrectly
uses task_env_map.error() as a boolean; change the check to if
(!task_env_map.has_value()) (or if (!task_env_map)) to detect failure, and
include the returned error details in the log (e.g. log task_env_map.error()
alongside the message) before returning; do not attempt to access
task_env_map.value() unless it has a value.

Comment on lines +328 to +336
auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId,
request->time_limit_seconds());

if (err.error()) {
CRANE_ERROR("[Step #{}.{}] Failed to change task time limit",
request->task_id(), kPrimaryStepId);
return Status::OK;
}
auto err =
stub->ChangeTaskTimeLimit(absl::Seconds(request->time_limit_seconds()));
if (err != CraneErrCode::SUCCESS) {
CRANE_ERROR("[Step #{}.{}] Failed to change time limit: {}.", job_id,
step_id, CraneErrStr(err));
} else {
response->set_ok(true);
}
response->set_ok(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same issue: Incorrect use of .error() for checking CraneExpected failure.

Similar to the previous issue, err.error() does not return a boolean but the error value. This will not behave as intended.

🔎 Proposed fix
   auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId,
                                             request->time_limit_seconds());

-  if (err.error()) {
+  if (!err.has_value()) {
     CRANE_ERROR("[Step #{}.{}] Failed to change task time limit",
                 request->task_id(), kPrimaryStepId);
     return Status::OK;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId,
request->time_limit_seconds());
if (err.error()) {
CRANE_ERROR("[Step #{}.{}] Failed to change task time limit",
request->task_id(), kPrimaryStepId);
return Status::OK;
}
auto err =
stub->ChangeTaskTimeLimit(absl::Seconds(request->time_limit_seconds()));
if (err != CraneErrCode::SUCCESS) {
CRANE_ERROR("[Step #{}.{}] Failed to change time limit: {}.", job_id,
step_id, CraneErrStr(err));
} else {
response->set_ok(true);
}
response->set_ok(true);
auto err = g_job_mgr->ChangeStepTimelimit(request->task_id(), kPrimaryStepId,
request->time_limit_seconds());
if (!err.has_value()) {
CRANE_ERROR("[Step #{}.{}] Failed to change task time limit",
request->task_id(), kPrimaryStepId);
return Status::OK;
}
response->set_ok(true);
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 328 to 336, the code
incorrectly uses err.error() as a boolean; change the check to detect failure
with if (!err) and then use err.error() (or the appropriate accessor) only when
logging the failure; update the CRANE_ERROR call to include the actual error
details from err.error() and keep returning Status::OK and setting
response->set_ok(true) on success.

Comment on lines +537 to +538
chmod(g_config.CranedUnixSockPath.c_str(), 0600);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

chmod before server start may fail if socket file doesn't exist yet.

The socket file is typically created by builder.BuildAndStart(). Calling chmod before the socket exists will silently fail (returning -1). Consider moving this after BuildAndStart() or checking the return value.

🔎 Proposed fix
   builder.RegisterService(m_service_impl_.get());

-  chmod(g_config.CranedUnixSockPath.c_str(), 0600);
-
   m_server_ = builder.BuildAndStart();
+
+  if (chmod(g_config.CranedUnixSockPath.c_str(), 0600) != 0) {
+    CRANE_WARN("Failed to chmod Craned Unix socket: {}", strerror(errno));
+  }
+
   CRANE_INFO("Craned is listening on [{}, {}:{}]",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chmod(g_config.CranedUnixSockPath.c_str(), 0600);
builder.RegisterService(m_service_impl_.get());
m_server_ = builder.BuildAndStart();
if (chmod(g_config.CranedUnixSockPath.c_str(), 0600) != 0) {
CRANE_WARN("Failed to chmod Craned Unix socket: {}", strerror(errno));
}
CRANE_INFO("Craned is listening on [{}, {}:{}]",
🤖 Prompt for AI Agents
In src/Craned/Core/CranedServer.cpp around lines 537-538, the call to
chmod(g_config.CranedUnixSockPath.c_str(), 0600) runs before the server creates
the socket and can fail; move the chmod to immediately after
builder.BuildAndStart() (or after the code path that creates the socket) so the
file exists, and check the return value of chmod — on failure log or return the
errno (and include errno/strerror in the log) so failures are visible instead of
silently ignored.

Comment on lines +53 to 56
} else {
// For now, the action never fail
std::unreachable();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

std::unreachable() may cause undefined behavior if failure path becomes reachable.

Using std::unreachable() here is dangerous. If QueryStepEnvAsync ever fails in the future (due to code changes or edge cases), this will invoke undefined behavior instead of returning a proper error. Consider returning an error status or logging.

🔎 Proposed fix
   } else {
-    // For now, the action never fail
-    std::unreachable();
+    // For now, the action should never fail, but handle gracefully
+    CRANE_ERROR("[Step] QueryStepEnvAsync failed unexpectedly.");
+    return Status::OK;
   }
🤖 Prompt for AI Agents
In src/Craned/Supervisor/SupervisorServer.cpp around lines 53 to 56, replace the
unsafe std::unreachable() in the else branch with proper error handling: detect
the unexpected failure path, log a descriptive error (including any available
error details), and return an appropriate error status or error response
consistent with the function's return type (or throw a well-documented exception
if that matches project conventions) so the failure is handled safely instead of
invoking undefined behavior.

Comment on lines +71 to +107
auto cfored_client = GetCforedClient();
if (x11) {
X11Meta x11_meta{};
if (x11_fwd)
x11_meta.x11_port = cfored_client->InitUvX11FwdHandler();
else
x11_meta.x11_port = GetStep().interactive_meta().x11_meta().port();

CRANE_TRACE("Crun task x11 enabled. Forwarding: {}, X11 Port: {}",
x11_fwd, x11_meta.x11_port);

auto x11_auth_path =
fmt::sprintf("%s/.crane/xauth/.Xauthority-XXXXXX", pwd.HomeDir());

bool ok = util::os::CreateFoldersForFileEx(x11_auth_path, pwd.Uid(),
pwd.Gid(), 0700);
if (!ok) {
CRANE_ERROR("Failed to create xauth source file for");
return CraneErrCode::ERR_SYSTEM_ERR;
}

// Default file permission is 0600.
int xauth_fd = mkstemp(x11_auth_path.data());
if (xauth_fd == -1) {
CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno));
return CraneErrCode::ERR_SYSTEM_ERR;
}

int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid());
if (ret == -1) {
CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno));
return CraneErrCode::ERR_SYSTEM_ERR;
}
x11_meta.x11_auth_path = x11_auth_path;
this->x11_meta = std::move(x11_meta);
close(xauth_fd);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clean up xauth file on early failures in X11 setup

When mkstemp or fchown fail, the temporary xauth file has already been created but is not recorded in x11_meta and thus will never be removed by StepInstance::CleanUp(). This can leave stray .Xauthority-XXXXXX files in the user's home under repeated failures.

Consider explicitly unlinking x11_auth_path (and closing xauth_fd if open) before returning an error in these branches, so the filesystem state is consistent even on partial failures.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)

844-869: Remove the unconditional io_meta copy that defeats the presence guard.

In CommonStepInCtld::GetStepToD(), the code conditionally copies io_meta when present, but then unconditionally copies it again:

if (StepToCtld().has_io_meta()) {
  auto* mutable_meta = step_to_d.mutable_io_meta();
  mutable_meta->CopyFrom(StepToCtld().io_meta());
}
step_to_d.set_sh_script(StepToCtld().sh_script());
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta());

The final mutable_io_meta()->CopyFrom(...) forces step_to_d.io_meta to be present even when StepToCtld().has_io_meta() is false, defeating the guard and breaking semantics for callers that rely on the presence of this field.

Delete the final unconditional copy line.

♻️ Duplicate comments (3)
src/Craned/Supervisor/TaskManager.cpp (2)

71-107: Clean up xauth file on early failures in X11 setup

When mkstemp succeeds but fchown fails (lines 99-103), the temporary xauth file has already been created but won't be recorded in x11_meta. This leaves orphaned .Xauthority-XXXXXX files in the user's home directory on repeated failures.

🔎 Proposed fix
       int xauth_fd = mkstemp(x11_auth_path.data());
       if (xauth_fd == -1) {
         CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno));
         return CraneErrCode::ERR_SYSTEM_ERR;
       }

       int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid());
       if (ret == -1) {
         CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno));
+        close(xauth_fd);
+        unlink(x11_auth_path.c_str());
         return CraneErrCode::ERR_SYSTEM_ERR;
       }

489-495: Handle missing node in nodelist gracefully.

If the current node is not found in the nodelist (line 490), the function logs an error and returns path_to_parse with unreplaced %n placeholders. This could result in file paths containing literal %n.

🔎 Proposed fix to use fallback node_index
   auto node_it = std::ranges::find(grpc_nodelist, g_config.CranedIdOfThisNode);
+  int node_index = 0;
   if (node_it == grpc_nodelist.end()) {
-    CRANE_ERROR("Failed to find current node {} in nodelist",
-                g_config.CranedIdOfThisNode);
-    return path_to_parse;
+    CRANE_WARN("Failed to find current node {} in nodelist, using node_index=0",
+               g_config.CranedIdOfThisNode);
+  } else {
+    node_index = std::distance(grpc_nodelist.begin(), node_it);
   }
-  int node_index = std::distance(grpc_nodelist.begin(), node_it);
src/CraneCtld/DbClient.cpp (1)

1830-1905: DB serialization now consistently uses sh_script; duplicate assignment fixed.

Mapping the script BSON field to TaskToCtld::sh_script() / StepToCtld::sh_script() in all four document builders is consistent with the proto changes, and the previous duplicate initialization issue is resolved.

Also applies to: 1907-1913, 2005-2007, 2071-2077

🧹 Nitpick comments (3)
src/Craned/Supervisor/CforedClient.h (2)

172-174: Potential integer overflow on next_x11_id_.

next_x11_id_ is a uint32_t that increments without bounds checking. While unlikely in practice (would require ~4 billion X11 connections), for long-running systems this could eventually wrap around and cause ID collisions. Consider whether this is acceptable for your use case.


60-61: Remove unused type alias x11_id_t.

This type alias is defined but never used in the codebase. Removing it eliminates dead code.

src/Craned/Supervisor/CforedClient.cpp (1)

725-767: Consider holding mutex during WriteStringToFd_ call or document thread safety.

The mutex is held while accessing m_fwd_meta_map and m_x11_fd_info_map_, but WriteStringToFd_ (which writes to the fd) is also called within the locked region. This is fine for correctness but could cause lock contention if writes block. The current approach is acceptable given that the fds should be non-blocking, but consider adding a comment noting this assumption.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1a1ffa and 2ab2a93.

📒 Files selected for processing (10)
  • protos/Crane.proto
  • protos/PublicDefs.proto
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/DbClient.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
🧰 Additional context used
🧠 Learnings (32)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/CforedClient.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • protos/Crane.proto
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/Craned/Supervisor/CforedClient.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
🧬 Code graph analysis (3)
src/Craned/Supervisor/CforedClient.cpp (5)
src/Utilities/PublicHeader/OS.cpp (2)
  • SetFdNonBlocking (153-160)
  • SetFdNonBlocking (153-153)
src/Craned/Supervisor/CforedClient.h (3)
  • next_x11_id_ (172-172)
  • exit_code (151-151)
  • signaled (152-153)
src/Utilities/PublicHeader/String.cpp (2)
  • SetCurrentThreadName (371-379)
  • SetCurrentThreadName (371-371)
src/Craned/Core/StepInstance.h (1)
  • exit_code (37-37)
src/Craned/Supervisor/SupervisorPublicDefs.h (1)
  • exit_code (37-37)
src/Craned/Supervisor/CforedClient.h (1)
src/Craned/Supervisor/CforedClient.cpp (12)
  • TaskProcessStop (825-838)
  • TaskProcessStop (825-826)
  • TaskOutPutForward (849-857)
  • TaskOutPutForward (849-850)
  • TaskErrOutPutForward (858-866)
  • TaskErrOutPutForward (858-859)
  • TaskX11ConnectForward (868-873)
  • TaskX11ConnectForward (868-868)
  • TaskX11OutPutForward (875-885)
  • TaskX11OutPutForward (875-877)
  • TaskX11OutputFinish (887-894)
  • TaskX11OutputFinish (887-887)
src/CraneCtld/CtldPublicDefs.h (2)
src/Craned/Supervisor/TaskManager.cpp (6)
  • IsBatch (156-158)
  • IsBatch (156-156)
  • IsCalloc (169-172)
  • IsCalloc (169-169)
  • IsCrun (164-167)
  • IsCrun (164-164)
src/CraneCtld/CtldPublicDefs.cpp (4)
  • SetCranedIds (155-159)
  • SetCranedIds (155-155)
  • SetCranedIds (1138-1141)
  • SetCranedIds (1138-1138)
🪛 Cppcheck (2.19.0)
src/Craned/Supervisor/TaskManager.cpp

[warning] 536-536: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.

(mismatchingContainerExpression)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (39)
src/Craned/Supervisor/CforedClient.h (3)

29-36: LGTM!

The X11FdInfo struct cleanly captures per-X11-connection state with the new x11_input_stopped flag for tracking input flow control.


38-58: LGTM!

The TaskFwdMeta struct is well-extended to support stderr handling with the new err_handle, stderr_read, and err_stopped fields. The structure maintains symmetry between stdout and stderr handling.


129-161: LGTM - Well-designed unified forwarding queue model.

The FwdRequest structure with std::variant cleanly consolidates all forwarding payload types (IOFwdRequest, X11FwdConnectReq, X11FwdReq, X11FwdEofReq, TaskFinishStatus) into a single queue interface. This design simplifies the forwarding pipeline and ensures type safety.

src/Craned/Supervisor/CforedClient.cpp (9)

83-84: LGTM!

Good defensive assertion to ensure all task forwarding metadata is cleaned up before destruction.


88-113: LGTM!

The function signature and implementation are properly updated to handle the new stderr_read parameter. The early marking of input_stopped when stdin_write == -1 is a sensible optimization.


157-203: LGTM - Clean per-connection X11 lifecycle management.

The X11 forwarding setup properly:

  • Assigns unique local IDs via next_x11_id_++
  • Registers the connection with TaskX11ConnectForward
  • Sets up event handlers for data, EOF, error, and close events
  • Stores connection state in m_x11_fd_info_map_ under mutex protection

340-359: LGTM - Stderr pipe handlers correctly implemented.

The stderr pipe handlers properly set err_stopped = true and conditionally call on_finish() only when output_stopped is also true. This correctly implements the "both streams must finish" logic.


369-407: LGTM - PTY mode correctly marks stderr as stopped.

Since PTY mode combines stdout/stderr into a single stream, marking err_stopped = true at line 370 is the correct approach to ensure the finish logic works properly.


430-464: LGTM - Cleanup logic properly extended for stderr.

The CleanStopTaskIOQueueCb_ function correctly handles the new stderr resources by closing and resetting err_handle and closing stderr_read fd.


490-558: LGTM - Clean variant-based message serialization.

The VariantVisitor pattern cleanly handles all FwdRequest types with appropriate protobuf payload construction. The code properly sets craned_id for X11 messages and handles both stdout/stderr through the is_stdout flag.


840-847: LGTM!

The TaskEnd function properly removes the task from m_fwd_meta_map under mutex protection and triggers the status change callback.


849-894: LGTM - Forward functions correctly implemented.

All the new task forwarding functions (TaskOutPutForward, TaskErrOutPutForward, TaskX11ConnectForward, TaskX11OutPutForward, TaskX11OutputFinish) properly enqueue requests to the unified m_task_fwd_req_queue_ with appropriate payload types. The TaskX11OutputFinish correctly erases the X11 fd info under mutex protection.

src/Craned/Supervisor/TaskManager.cpp (9)

112-150: LGTM on cleanup logic.

The cleanup properly handles X11 auth file removal with a guard against uninitialized template paths, and the cgroup cleanup retry loop is appropriately bounded.


223-298: LGTM on status transition validation.

The state machine properly validates status transitions with helpful warning logs for unexpected states while avoiding blocking errors. The use of std::unreachable() is appropriate for C++23.


371-412: LGTM on per-task cgroup lifecycle.

The per-task cgroup allocation in Prepare() and cleanup in Cleanup() properly support the multi-task execution model with appropriate retry bounds for process termination.


588-700: LGTM on fd management in ForkCrunAndInitMeta_.

The use of opened_fd set to track file descriptors for cleanup on error paths is a good pattern. The function properly handles the complexity of stdin/stdout/stderr pipe setup for both forwarding and file-based IO.


1810-1820: LGTM on child process setup.

The child process correctly migrates to the task cgroup early in the fork path before setting other properties. Using std::abort() on cgroup migration failure is appropriate for the child process context.


2457-2520: LGTM on terminate task queue handling.

The termination logic properly handles the different cases: daemon steps with orphaned flag, empty task maps, and the re-queue pattern for not-ready elements is consistent with other handlers.


2564-2664: LGTM on task execution callback.

The execution flow properly:

  1. Creates task instances before step preparation
  2. Fails all tasks on step preparation error
  3. Allocates step cgroup before launching tasks
  4. Tracks execution IDs for task-to-pid mapping

2680-2709: LGTM on SSH process migration to cgroup.

The implementation correctly:

  1. Validates that only daemon steps can perform SSH process migration
  2. Lazily allocates the step cgroup if not already present
  3. Returns appropriate error codes for different failure modes

2318-2330: LGTM on exit code calculation for crun tasks.

The exit code calculation properly adds KCrunExitCodeStatusNum for signaled processes, maintaining consistency with shell conventions (128 + signal number).

src/Craned/Supervisor/TaskManager.h (6)

65-86: LGTM on new struct definitions.

The X11Meta and TerminationStatus structs are well-defined with clear purpose:

  • X11Meta encapsulates X11 forwarding state
  • TerminationStatus tracks final step termination info for multi-task scenarios

90-106: LGTM on StepInstance constructor.

The initialization of task_ids from step.task_res_map() keys using C++23 ranges is clean and correct.


212-225: LGTM on CrunInstanceMeta extension.

The new fields (stderr_write, stderr_read, fwd_stdint, fwd_stdout, fwd_stderr) properly support the batch input/output/error option feature from the PR objectives.


235-278: LGTM on ITaskInstance changes.

The per-task cgroup member (m_task_cg) and helper methods (IsBatch(), IsCrun()) properly support the multi-task execution model. The constructor correctly accepts task_id as a parameter.


433-501: LGTM on TaskManager async method declarations.

The new methods follow the established async pattern with queue/handle pairs. The default parameters for ShutdownSupervisorAsync provide sensible defaults for the common success case.


549-591: LGTM on async infrastructure additions.

The new async handles and concurrent queues follow the established patterns in the class. The queue types correctly correspond to their respective async method signatures.

protos/PublicDefs.proto (3)

109-121: Status enum rename to Starting looks consistent but check all status-group logic.

TaskStatus gained Starting = 8 (replacing prior “Configured” semantics). Make sure all status-group checks (e.g. “running-like” states in schedulers, DB queries, and recovery code) were updated to treat Starting appropriately (e.g., included with Running where intended) to avoid misclassification.


135-186: IO meta + script threading is coherent; confirm all producers fill them consistently.

IoMeta and sh_script are now carried in TaskToCtld, StepToCtld, and StepToD, with StepToD also adding task_res_map and ntasks_total. The design matches the new per-task IO model, but you should double‑check that:

  • All submission paths populate io_meta/sh_script as expected (batch, crun, container).
  • ntasks_total matches task_res_map.size() for execution steps, and that daemon steps’ approximation doesn’t confuse consumers.

Also applies to: 218-255, 311-352


364-379: IoMeta and LicenseInfo message definitions look sound.

The new IoMeta (with append mode, file patterns, and crun task routing) and LicenseInfo messages are well-structured and align with how they’re consumed in other files; no structural issues spotted.

Also applies to: 829-834

src/CraneCtld/RpcService/CtldGrpcServer.h (1)

73-107: New ResAllocInfo handling in WriteTaskResAllocReply is correct.

The method correctly consumes StepResAllocArgs’s ResAllocInfo, populating allocated_craned_regex, craned_ids, craned_task_map, and ntasks_total, and propagates errors via failure_reason. Provided the producer keeps ntasks_total and craned_task_map consistent, this looks good.

src/CraneCtld/CtldPublicDefs.cpp (3)

149-159: Craned‑ID propagation and daemon step initialization look consistent.

StepInCtld::SetCranedIds now mirrors the vector‑based storage and keeps RuntimeAttr.craned_ids in sync. DaemonStepInCtld::InitFromJob derives a unified node_set from job.CranedIds() for execution/configuring/running nodes, and GetStepToD sets ntasks_total from job->ntasks_per_node * job->node_num (with a FIXME). This is internally consistent; just ensure consumers of ntasks_total tolerate the approximation for daemon steps.

Also applies to: 264-275, 333-415


469-608: New StepStatusChange semantics and AllExecutionStepsFinished() look reasonable.

Both daemon and common steps now return std::optional<std::pair<TaskStatus,uint32_t>>, with non‑nullopt indicating the job has a final status/exit code:

  • Daemon step returns either its own status (if primary step never created) or the primary step’s status.
  • Common step returns primary step status once all execution steps (primary + common) are gone, as detected by AllExecutionStepsFinished().

This matches the intended “job‑level” completion semantics; just verify the RPC handler uses the returned pair (when present) as the single source of truth for updating TaskInCtld and persistence.

Also applies to: 873-911, 918-920


632-747: Per‑task resource mapping is coherent; ensure consistency between maps and counts.

Primary and dynamically scheduled common steps now populate:

  • CommonStepInCtld::task_res_map with per‑task ResourceInNode (zeroing per‑task mem when step‑level mem is used), and
  • CommonStepInCtld::craned_task_map mapping each CranedId to its local task IDs.

CommonStepInCtld::GetStepToD exposes task_res_map and ntasks_total = task_res_map.size() into StepToD, and the scheduler’s SchedulePendingSteps populates the same structures for crun common steps. This is consistent with the new proto schema; just keep an eye on:

  • Batch/Calloc vs crun semantics (single aggregated task vs per‑task mapping), and
  • ntasks_total always matching task_res_map.size() for execution steps.

Also applies to: 802-871, 958-1037

protos/Crane.proto (2)

476-479: New IO/X11/exit‑status payloads are structurally consistent; verify end‑to‑end wiring.

You’ve extended:

  • StreamCtldReply.TaskResAllocatedReply and StreamCrunReply.TaskResAllocatedReply with per‑craned Tasks maps and ntasks_total.
  • StreamCrun{Request,Reply} and StreamTaskIO{Request,Reply} with task‑scoped IO, X11 connection/EOF, error‑output, and exit‑status messages.

The schemas look coherent and use fresh tags; please confirm that:

  • All senders populate craned_task_map and ntasks_total consistently with the server‑side ResAllocInfo, and
  • Supervisors/frontends handle the new X11/IO/exit‑status variants without breaking older behaviors.

Also applies to: 544-563, 704-734, 814-901, 904-965, 968-1003


544-547: QueryNodeEvents* and PluginQueryService additions look fine; check client compatibility.

The QueryNodeEvents{Request,Reply}/NodeEventInfo messages and the new PluginQueryService (with QueryTaskEfficiency and QueryNodeEvents) are cleanly defined. Just ensure:

  • All plugin clients are updated to use PluginQueryService (if this replaces an older service name), and
  • Authorization via the uid fields is enforced on the server side.

Also applies to: 1170-1173

src/CraneCtld/CtldPublicDefs.h (3)

315-353: Richer StepResAllocArgs API is well‑designed; confirm all callbacks updated.

InteractiveMeta::StepResAllocArgs now wraps a ResAllocInfo inside std::expected, carrying regex, ordered allocated_craned_ids, craned_task_map, and ntasks_total. This is a clear improvement over the previous pair‑based API; just ensure all cb_step_res_allocated implementations (Cfored/Crun) have been updated to the new structure.


526-590: Switching steps’ m_craned_ids_ to std::vector and adding type helpers looks good.

Using std::vector<CranedId> for m_craned_ids_ and exposing it via SetCranedIds / CranedIds() aligns step storage with job‑side craned_ids. The inline IsBatch(), IsCalloc(), and IsCrun() helpers mirror the semantics in StepInstance on the Craned side, which helps keep Ctld/Supervisor logic consistent.


676-709: New per‑step/task state in CommonStepInCtld and TaskInCtld is consistent with usage in the .cpp.

  • CommonStepInCtld now tracks task_res_map and craned_task_map, which are exposed in GetStepToD and populated from job allocations and the scheduler.
  • TaskInCtld::IsCrun() and AllExecutionStepsFinished() complement the existing helpers and are used by the updated status‑change logic and scheduler.
  • SchedulePendingSteps’ declaration matches the richer logic implemented in the .cpp.

These additions look coherent and keep the public API compact.

Also applies to: 832-921, 958-1037

Comment on lines +282 to +302
ph->on<uvw::end_event>([this, tid = task_id, on_finish](
uvw::end_event&, uvw::pipe_handle& h) {
// EOF Received
h.close();
CRANE_INFO("[Task #{}] Output pipe received EOF.", tid);
auto& meta = this->m_fwd_meta_map.at(tid);
meta.output_stopped = true;
if (meta.err_stopped) on_finish();
on_finish();
});

ph->on<uvw::error_event>([this, tid = meta.task_id, on_finish](
uvw::error_event& e, uvw::pipe_handle& h) {
CRANE_WARN("[Task #{}] output pipe error: {}. Closing.", tid,
e.what());
h.close();
auto& meta = this->m_fwd_meta_map.at(tid);
meta.output_stopped = true;
if (meta.err_stopped) on_finish();
on_finish();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Double call to on_finish() in stdout pipe handlers.

In both the end_event and error_event handlers for the stdout pipe, on_finish() is called twice - once conditionally (if (meta.err_stopped) on_finish();) and once unconditionally on the next line. This will cause duplicate task cleanup attempts.

🔎 Proposed fix
        ph->on<uvw::end_event>([this, tid = task_id, on_finish](
                                   uvw::end_event&, uvw::pipe_handle& h) {
          // EOF Received
          h.close();
          CRANE_INFO("[Task #{}] Output pipe received EOF.", tid);
          auto& meta = this->m_fwd_meta_map.at(tid);
          meta.output_stopped = true;
-          if (meta.err_stopped) on_finish();
-          on_finish();
+          if (meta.err_stopped) on_finish();
        });

        ph->on<uvw::error_event>([this, tid = meta.task_id, on_finish](
                                     uvw::error_event& e, uvw::pipe_handle& h) {
          CRANE_WARN("[Task #{}] output pipe error: {}. Closing.", tid,
                     e.what());
          h.close();
          auto& meta = this->m_fwd_meta_map.at(tid);
          meta.output_stopped = true;
-          if (meta.err_stopped) on_finish();
-          on_finish();
+          if (meta.err_stopped) on_finish();
        });
🤖 Prompt for AI Agents
In src/Craned/Supervisor/CforedClient.cpp around lines 282-302, both the
end_event and error_event stdout pipe handlers call on_finish() twice (once
conditionally and once unconditionally), causing duplicate cleanup; fix by
removing the unconditional on_finish() call in each handler so you only set
meta.output_stopped = true and then call on_finish() only if meta.err_stopped is
true (i.e., call on_finish() exactly once when both stdout and stderr are
stopped).

Comment on lines +825 to 838
bool CforedClient::TaskProcessStop(task_id_t task_id, uint32_t exit_code,
bool signaled) {
CRANE_DEBUG("[Task #{}] Process stopped with exit_code: {}, signaled: {}.",
task_id, exit_code, signaled);
m_task_fwd_req_queue_.enqueue(FwdRequest{
.type = StreamTaskIORequest::TASK_EXIT_STATUS,
.data = TaskFinishStatus{.task_id = task_id,
.exit_code = exit_code,
.signaled = signaled},
});
absl::MutexLock lock(&m_mtx_);
m_fwd_meta_map[task_id].proc_stopped = true;
return m_fwd_meta_map[task_id].output_stopped;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing check for task existence in m_fwd_meta_map.

TaskProcessStop directly accesses m_fwd_meta_map[task_id] without checking if the task exists. If called with an unknown task_id, this will create an empty entry and return uninitialized values.

🔎 Proposed fix
 bool CforedClient::TaskProcessStop(task_id_t task_id, uint32_t exit_code,
                                    bool signaled) {
   CRANE_DEBUG("[Task #{}] Process stopped with exit_code: {}, signaled: {}.",
               task_id, exit_code, signaled);
   m_task_fwd_req_queue_.enqueue(FwdRequest{
       .type = StreamTaskIORequest::TASK_EXIT_STATUS,
       .data = TaskFinishStatus{.task_id = task_id,
                                .exit_code = exit_code,
                                .signaled = signaled},
   });
   absl::MutexLock lock(&m_mtx_);
+  auto it = m_fwd_meta_map.find(task_id);
+  if (it == m_fwd_meta_map.end()) {
+    CRANE_ERROR("[Task #{}] Cannot find fwd meta in TaskProcessStop.", task_id);
+    return true;  // Assume output stopped if task not found
+  }
-  m_fwd_meta_map[task_id].proc_stopped = true;
-  return m_fwd_meta_map[task_id].output_stopped;
+  it->second.proc_stopped = true;
+  return it->second.output_stopped && it->second.err_stopped;
 }

Note: The return value should likely also check err_stopped in addition to output_stopped to be consistent with the new stderr handling model.

Comment on lines +2246 to +2267
void TaskManager::EvShutdownSupervisorCb_() {
std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status;
bool got_final_status = false;
do {
got_final_status = m_shutdown_status_queue_.try_dequeue(final_status);
if (!got_final_status) continue;
auto& [status, exit_code, reason] = final_status;
if (m_step_.IsDaemonStep()) {
CRANE_DEBUG("Sending a {} status as daemon step.", status);
m_step_.GotNewStatus(status);
if (!m_step_.orphaned)
g_craned_client->StepStatusChangeAsync(status, exit_code, reason);
}

m_step_.CleanUp();

g_craned_client->Shutdown();
g_server->Shutdown();

this->Shutdown();
} while (!got_final_status);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential busy loop in shutdown callback.

The do-while(!got_final_status) loop combined with if (!got_final_status) continue; creates a busy loop that spins until an element is available in the queue. Since this is an async callback triggered by m_shutdown_supervisor_handle_->send(), it should process available elements rather than blocking.

Consider either:

  1. Changing to a simple while (m_shutdown_status_queue_.try_dequeue(...)) pattern
  2. Or returning early if nothing is in the queue
🔎 Proposed fix
 void TaskManager::EvShutdownSupervisorCb_() {
   std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status;
-  bool got_final_status = false;
-  do {
-    got_final_status = m_shutdown_status_queue_.try_dequeue(final_status);
-    if (!got_final_status) continue;
+  while (m_shutdown_status_queue_.try_dequeue(final_status)) {
     auto& [status, exit_code, reason] = final_status;
     if (m_step_.IsDaemonStep()) {
       CRANE_DEBUG("Sending a {} status as daemon step.", status);
       m_step_.GotNewStatus(status);
       if (!m_step_.orphaned)
         g_craned_client->StepStatusChangeAsync(status, exit_code, reason);
     }

     m_step_.CleanUp();

     g_craned_client->Shutdown();
     g_server->Shutdown();

     this->Shutdown();
-  } while (!got_final_status);
+  }
 }
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2246-2267 the shutdown
callback currently uses a do/while with a continue on !got_final_status which
creates a busy spin waiting for queue items; replace this with a non-blocking
drain: first try to dequeue once and return immediately if none are available,
otherwise use a while(m_shutdown_status_queue_.try_dequeue(final_status)) loop
to process each available final_status exactly once (remove the continue and the
do/while), and after draining the queue perform the existing cleanup,
client/server shutdown and this->Shutdown() calls.

fix: step lost

Signed-off-by: junlinli <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (2)

65-107: Clean up X11 xauth file on early failures in StepInstance::Prepare().

In the X11 branch, if mkstemp or fchown fails, the temporary xauth file has already been created but x11_meta is never set, and the function returns immediately:

int xauth_fd = mkstemp(x11_auth_path.data());
if (xauth_fd == -1) {
  CRANE_ERROR("mkstemp() for xauth file failed: {}\n", strerror(errno));
  return CraneErrCode::ERR_SYSTEM_ERR;
}

int ret = fchown(xauth_fd, pwd.Uid(), pwd.Gid());
if (ret == -1) {
  CRANE_ERROR("fchown() for xauth file failed: {}\n", strerror(errno));
  return CraneErrCode::ERR_SYSTEM_ERR;
}

Since x11_meta is only populated after these succeed, CleanUp() cannot remove the stray .Xauthority-XXXXXX file on these error paths. This matches a previously raised concern.

Consider explicitly closing xauth_fd and unlinking x11_auth_path before returning in these failure branches so repeated X11 setup failures do not leave garbage files in the user’s home.

Also applies to: 112-123


2199-2230: Eliminate potential busy-wait in EvShutdownSupervisorCb_().

EvShutdownSupervisorCb_() uses a do { ... } while (!got_final_status); loop with:

got_final_status = m_shutdown_status_queue_.try_dequeue(final_status);
if (!got_final_status) continue;

If m_shutdown_supervisor_handle_->send() is ever called without a corresponding enqueued element (or there is a race), this becomes a tight busy loop in the uvw thread, spinning until something shows up in the queue. A non-blocking async callback should just process whatever is available and return.

A safer pattern is:

  • Try to dequeue once; if nothing is available, return immediately.
  • Otherwise, process the dequeued element (or drain all available ones) and then perform shutdown.

For example:

Proposed fix
 void TaskManager::EvShutdownSupervisorCb_() {
   std::tuple<crane::grpc::TaskStatus, uint32_t, std::string> final_status;
-  bool got_final_status = false;
-  do {
-    got_final_status = m_shutdown_status_queue_.try_dequeue(final_status);
-    if (!got_final_status) continue;
+  if (!m_shutdown_status_queue_.try_dequeue(final_status)) {
+    return;
+  }
+  // Optionally: while (m_shutdown_status_queue_.try_dequeue(final_status)) { ... }
     auto& [status, exit_code, reason] = final_status;
     if (m_step_.IsDaemonStep()) {
       CRANE_DEBUG("Sending a {} status as daemon step.", status);
       m_step_.GotNewStatus(status);
       if (!m_step_.orphaned)
         g_craned_client->StepStatusChangeAsync(status, exit_code, reason);
     }

     m_step_.CleanUp();

     g_craned_client->Shutdown();
     g_server->Shutdown();

     this->Shutdown();
-  } while (!got_final_status);
 }
🧹 Nitpick comments (1)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

121-121: Naming inconsistency: use lowercase k prefix.

The constant KCrunExitCodeStatusNum uses an uppercase K prefix, while all other constants in the ExitCode namespace use a lowercase k prefix (e.g., kExitStatusNum, kTerminationSignalBase). Consider renaming to kCrunExitCodeStatusNum for consistency.

🔎 Proposed fix
-constexpr size_t KCrunExitCodeStatusNum = 128;
+constexpr size_t kCrunExitCodeStatusNum = 128;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ab2a93 and 9413b9b.

📒 Files selected for processing (5)
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
🧰 Additional context used
🧠 Learnings (30)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/SupervisorPublicDefs.h:32-37
Timestamp: 2025-05-26T11:06:28.796Z
Learning: The user L-Xiafeng prefers to defer refactoring duplicate definitions until they become a larger pattern in the codebase, rather than addressing individual instances immediately.
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-08T07:38:42.362Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
🧬 Code graph analysis (3)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Utilities/PublicHeader/OS.cpp (4)
  • CreateFoldersForFileEx (109-148)
  • CreateFoldersForFileEx (109-110)
  • DeleteFile (64-71)
  • DeleteFile (64-64)
src/Craned/Core/StepInstance.cpp (6)
  • ec (278-278)
  • ec (297-297)
  • GotNewStatus (336-408)
  • GotNewStatus (336-336)
  • Prepare (83-90)
  • Prepare (83-83)
src/CraneCtld/CtldPublicDefs.h (1)
src/CraneCtld/CtldPublicDefs.cpp (8)
  • SetCranedIds (155-159)
  • SetCranedIds (155-155)
  • SetCranedIds (1138-1141)
  • SetCranedIds (1138-1138)
  • StepStatusChange (469-611)
  • StepStatusChange (470-475)
  • StepStatusChange (873-1069)
  • StepStatusChange (874-879)
src/Craned/Supervisor/TaskManager.h (3)
src/Craned/Supervisor/TaskManager.cpp (14)
  • GotNewStatus (223-298)
  • GotNewStatus (223-223)
  • IsBatch (156-158)
  • IsBatch (156-156)
  • IsCrun (164-167)
  • IsCrun (164-164)
  • SupervisorFinishInit (2037-2039)
  • SupervisorFinishInit (2037-2037)
  • ShutdownSupervisorAsync (2045-2052)
  • ShutdownSupervisorAsync (2045-2047)
  • EvGrpcCheckStatusCb_ (2636-2641)
  • EvGrpcCheckStatusCb_ (2636-2636)
  • EvGrpcMigrateSshProcToCgroupCb_ (2643-2672)
  • EvGrpcMigrateSshProcToCgroupCb_ (2643-2643)
src/Craned/Common/CgroupManager.h (1)
  • CgroupInterface (389-421)
src/Craned/Core/StepInstance.cpp (2)
  • GotNewStatus (336-408)
  • GotNewStatus (336-336)
🪛 Cppcheck (2.19.0)
src/Craned/Supervisor/TaskManager.cpp

[warning] 539-539: Iterators to containers from different expressions 'input' and 'groups[0]' are used together.

(mismatchingContainerExpression)

🔇 Additional comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (3)

130-139: LGTM! Well-documented exit code ranges.

The explicit range markers improve code clarity and align correctly with the existing constants (kExitStatusNum, kTerminationSignalNum, kCraneExitCodeBase). The math checks out:

  • Exit codes: 0-255 (256 values)
  • Signal termination: 256-319 (64 values)
  • Crane-defined codes: 320+

555-561: LGTM! Standard variant visitor pattern.

This is a well-established C++17 idiom for visiting std::variant types. The implementation correctly uses variadic templates and fold expressions to combine multiple callable objects. Placement at the end of the file is appropriate for this utility template.


116-117: Constants kCrunFwdALL and kCrunFwdNONE are actively used.

Both constants are referenced in src/Craned/Supervisor/TaskManager.cpp (lines 588 and 590) within the CrunParseFilePattern_ method, where they compare against I/O forwarding pattern strings. The constants follow naming conventions and are not orphaned.

Comment on lines +849 to 869
switch (this->type) {
case crane::grpc::Batch:
step_to_d.mutable_batch_meta()->CopyFrom(StepToCtld().batch_meta());
break;
case crane::grpc::Interactive:
step_to_d.mutable_interactive_meta()->CopyFrom(
StepToCtld().interactive_meta());
break;
case crane::grpc::Container:
step_to_d.mutable_container_meta()->CopyFrom(StepToCtld().container_meta());
break;
default:
std::unreachable();
}
if (StepToCtld().has_io_meta()) {
auto* mutable_meta = step_to_d.mutable_io_meta();
mutable_meta->CopyFrom(StepToCtld().io_meta());
}
step_to_d.set_sh_script(StepToCtld().sh_script());
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid duplicating io_meta assignment in CommonStepInCtld::GetStepToD.

io_meta is first copied under if (StepToCtld().has_io_meta()) and then unconditionally copied again:

if (StepToCtld().has_io_meta()) {
  auto* mutable_meta = step_to_d.mutable_io_meta();
  mutable_meta->CopyFrom(StepToCtld().io_meta());
}
step_to_d.set_sh_script(StepToCtld().sh_script());
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta());

The second CopyFrom is redundant and also forces step_to_d.io_meta to be present even when StepToCtld().has_io_meta() is false, which subtly changes presence semantics for consumers relying on has_io_meta(). Consider removing the unconditional copy and only propagating io_meta inside the guarded block.

🤖 Prompt for AI Agents
In src/CraneCtld/CtldPublicDefs.cpp around lines 849-869, the code copies
io_meta twice — once guarded by if (StepToCtld().has_io_meta()) and then
unconditionally via step_to_d.mutable_io_meta()->CopyFrom(...), which forces
presence even when absent; remove the unconditional
step_to_d.mutable_io_meta()->CopyFrom(StepToCtld().io_meta()) so io_meta is only
copied inside the has_io_meta() guard (leave the sh_script assignment and the
type-based meta copies unchanged).

Comment on lines +612 to +710
CraneExpected<pid_t> ProcInstance::ForkCrunAndInitIOfd_() {
auto* meta = GetCrunMeta_();
CRANE_DEBUG("Launch crun task #{} pty: {}", task_id,
m_parent_step_inst_->pty);

std::set<int> opened_fd{};
const auto open_dev_null = [this]() -> int {
int dev_null_fd = open("/dev/null", O_RDWR);
if (dev_null_fd == -1) {
CRANE_ERROR("[Task #{}] Failed to open /dev/null: {}", task_id,
std::strerror(errno));
return -1;
}
return dev_null_fd;
};
int stdin_local_fd = -1;
int stdout_local_fd = -1;
int stderr_local_fd = -1;

int to_crun_pipe[2];
int from_crun_pipe[2];
int from_crun_stdout_pipe[2];
int from_crun_stderr_pipe[2];
int crun_pty_fd = -1;

int open_mode =
GetParentStep().io_meta().open_mode_append() ? O_APPEND : O_TRUNC;
if (!m_parent_step_inst_->pty) {
if (pipe(to_crun_pipe) == -1) {
CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}",
task_id, strerror(errno));
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
if (meta->fwd_stdin) {
if (pipe(to_crun_pipe) == -1) {
CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}",
task_id, std::strerror(errno));
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
opened_fd.insert(to_crun_pipe[0]);
opened_fd.insert(to_crun_pipe[1]);
} else {
if (meta->parsed_input_file_pattern.empty()) {
int null_fd = open_dev_null();
if (null_fd == -1) {
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
stdin_local_fd = null_fd;
} else {
int fd = open(meta->parsed_input_file_pattern.c_str(), O_RDONLY);
if (fd == -1) {
CRANE_ERROR("[Task #{}] Failed to open input file {}: {}", task_id,
meta->parsed_input_file_pattern, std::strerror(errno));
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
stdin_local_fd = fd;
}
opened_fd.insert(stdin_local_fd);
}

if (pipe(from_crun_pipe) == -1) {
CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}",
task_id, strerror(errno));
close(to_crun_pipe[0]);
close(to_crun_pipe[1]);
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
if (meta->fwd_stdout) {
if (pipe(from_crun_stdout_pipe) == -1) {
CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}",
task_id, strerror(errno));
for (int fd : opened_fd) {
close(fd);
}
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
opened_fd.insert(from_crun_stdout_pipe[0]);
opened_fd.insert(from_crun_stdout_pipe[1]);
} else {
if (meta->parsed_output_file_pattern.empty()) {
int null_fd = open_dev_null();
if (null_fd == -1) {
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
stdout_local_fd = null_fd;
} else {
int fd = open(meta->parsed_output_file_pattern.c_str(),
O_RDWR | O_CREAT | open_mode, 0644);
if (fd == -1) {
CRANE_ERROR("[Task #{}] Failed to open output file {}: {}", task_id,
meta->parsed_output_file_pattern, std::strerror(errno));
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
stdout_local_fd = fd;
}
opened_fd.insert(stdout_local_fd);
}

if (meta->fwd_stderr) {
if (pipe(from_crun_stderr_pipe) == -1) {
CRANE_ERROR("[Task #{}] Failed to create pipe for task io forward: {}",
task_id, strerror(errno));
for (int fd : opened_fd) {
close(fd);
}
return std::unexpected(CraneErrCode::ERR_SYSTEM_ERR);
}
opened_fd.insert(from_crun_stderr_pipe[0]);
opened_fd.insert(from_crun_stderr_pipe[1]);
} else {
if (meta->parsed_error_file_pattern.empty() &&
meta->parsed_output_file_pattern.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close local stdin/stdout/stderr FDs in the parent for non-forwarded crun IO.

ForkCrunAndInitIOfd_() opens several FDs before fork():

  • stdin_local_fd when !meta->fwd_stdin
  • stdout_local_fd when !meta->fwd_stdout
  • stderr_local_fd when !meta->fwd_stderr

These are used only in the child (via SetupCrunFwdAtChild_() and dup2) and are never needed in the parent. Currently they are:

  • Inserted into opened_fd for error cleanup on pid == -1
  • Never closed in the successful parent path (only pipe ends are selectively closed in SetupCrunFwdAtParent_())

Over many crun tasks in a long-lived supervisor this can accumulate unnecessary open file descriptors in the supervisor process.

Consider explicitly closing these local-only descriptors in the parent once fork() succeeds, e.g.:

Proposed fix sketch
  pid_t pid = -1;
  if (m_parent_step_inst_->pty) {
    pid = forkpty(&crun_pty_fd, nullptr, nullptr, nullptr);
    ...
  } else {
    pid = fork();
    ...
  }

+ if (pid > 0 && !m_parent_step_inst_->pty) {
+   if (!meta->fwd_stdin && stdin_local_fd != -1) {
+     close(stdin_local_fd);
+   }
+   if (!meta->fwd_stdout && stdout_local_fd != -1) {
+     close(stdout_local_fd);
+   }
+   if (!meta->fwd_stderr && stderr_local_fd != -1) {
+     close(stderr_local_fd);
+   }
+ }

  if (pid == -1) {
    for (int fd : opened_fd) {
      close(fd);
    }
  }

This keeps the child behavior unchanged while preventing FD leaks in the supervisor.

Also applies to: 732-775

🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 612-710, the function
ForkCrunAndInitIOfd_() opens stdin_local_fd, stdout_local_fd and stderr_local_fd
for use only by the child but never closes them in the successful parent path,
leaking FDs; after fork() succeeds in the parent branch, explicitly close any of
stdin_local_fd, stdout_local_fd and stderr_local_fd that are >= 0 and were
created for non-forwarded IO (i.e., when meta->fwd_stdin/stdout/stderr are
false) and remove them from opened_fd so they are not left open, while leaving
all child behavior unchanged; apply the same fix to the analogous block around
lines 732-775.

Comment on lines 1605 to +1652
CraneErrCode ProcInstance::Prepare() {
auto err = ITaskInstance::Prepare();
if (err != CraneErrCode::SUCCESS) return err;

// Write m_meta_
if (m_parent_step_inst_->IsBatch()) {
// Prepare file output name for batch tasks.
/* Perform file name substitutions
* %j - Job ID
* %u - Username
* %x - Job name
*/
auto meta = std::make_unique<BatchInstanceMeta>();
meta->parsed_output_file_pattern = ParseFilePathPattern_(
GetParentStep().batch_meta().output_file_pattern(),
GetParentStep().cwd());

// If -e / --error is not defined, leave
// batch_meta.parsed_error_file_pattern empty;
if (!GetParentStep().batch_meta().error_file_pattern().empty()) {
meta->parsed_error_file_pattern = ParseFilePathPattern_(
GetParentStep().batch_meta().error_file_pattern(),
GetParentStep().cwd());
}

m_meta_ = std::move(meta);
} else {
m_meta_ = std::make_unique<CrunInstanceMeta>();
}
std::filesystem::path sh_path = m_parent_step_inst_->script_path.value();
m_meta_->parsed_sh_script_path = sh_path;
if (IsCrun()) {
auto* meta = GetCrunMeta_();
{
auto [path, fwd] = CrunParseFilePattern_(
GetParentStep().io_meta().output_file_pattern());
meta->fwd_stdout = fwd;
meta->parsed_output_file_pattern = path;
}
{
auto [path, fwd] =
CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern());
meta->fwd_stdin = fwd;
meta->parsed_output_file_pattern = path;
}
{
auto [path, fwd] =
CrunParseFilePattern_(GetParentStep().io_meta().error_file_pattern());
meta->fwd_stderr = fwd;
meta->parsed_error_file_pattern = path;
}

} else {
m_meta_->parsed_output_file_pattern =
ParseFilePathPattern_(GetParentStep().io_meta().output_file_pattern(),
GetParentStep().cwd(), true, nullptr);

// If -e / --error is not defined, leave
// parsed_error_file_pattern empty;
m_meta_->parsed_error_file_pattern =
ParseFilePathPattern_(GetParentStep().io_meta().error_file_pattern(),
GetParentStep().cwd(), false, nullptr);

m_meta_->parsed_input_file_pattern =
ParseFilePathPattern_(GetParentStep().io_meta().input_file_pattern(),
GetParentStep().cwd(), false, nullptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix crun stdin pattern handling: parsed input is written into the wrong field.

In ProcInstance::Prepare(), the crun branch initializes CrunInstanceMeta from io_meta:

{
  auto [path, fwd] =
      CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern());
  meta->fwd_stdin = fwd;
  meta->parsed_output_file_pattern = path;   // <- here
}

path for stdin is mistakenly stored in parsed_output_file_pattern instead of parsed_input_file_pattern. Later, ForkCrunAndInitIOfd_() decides whether to open a local stdin file based on meta->parsed_input_file_pattern.empty(), so non-forwarded stdin ends up being treated as “no input” and redirected to /dev/null.

This breaks per-task stdin redirection for crun steps.

Proposed fix
    {
      auto [path, fwd] =
           CrunParseFilePattern_(GetParentStep().io_meta().input_file_pattern());
      meta->fwd_stdin = fwd;
-     meta->parsed_output_file_pattern = path;
+     meta->parsed_input_file_pattern = path;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants