Skip to content

Conversation

@zhansan114514
Copy link
Contributor

@zhansan114514 zhansan114514 commented Oct 4, 2025

Summary by CodeRabbit

  • New Features
    • Add suspend/resume controls for jobs via task modification attributes, new RPCs, and per-job/step results with failure reasons.
    • Introduce a "Suspended" task status to represent paused tasks.
    • Jobs can be paused mid-execution and later resumed; cluster controllers broadcast suspend/resume and persist state transitions.
    • Supervisors support asynchronous suspend/resume handling (non-blocking).
    • Add cgroup freezer support (freeze/thaw) to enable suspending/resuming containerized tasks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Walkthrough

Adds suspend/resume job support across the system: new protobuf messages/enums and RPCs (CraneCtld, Craned, Supervisor), supervisor async suspend/resume queue and task-level suspend/resume implementations, TaskScheduler orchestration for suspend/resume, Craned/CranedKeeper RPC clients, and multiple const-ref/formatting signature normalizations.

Changes

Cohort / File(s) Summary
Protobufs
protos/Crane.proto, protos/Supervisor.proto, protos/PublicDefs.proto
Add Suspend/Resume request/reply messages and RPCs; extend ModifyTaskRequest.TargetAttributes with Suspend/Resume; add TaskStatus.Suspended = 10.
CraneCtld — TaskScheduler
src/CraneCtld/TaskScheduler.h, src/CraneCtld/TaskScheduler.cpp
Add SuspendRunningTasks() and ResumeSuspendedTasks(): validate tasks, collect executing craned nodes, broadcast SuspendJobs/ResumeJobs, aggregate per-node results, handle partial failures (terminate on error), and persist status changes.
CraneCtld — Rpc client / CranedKeeper
src/CraneCtld/RpcService/CranedKeeper.h, src/CraneCtld/RpcService/CranedKeeper.cpp
Add CranedStub::SuspendJobs()/ResumeJobs(); normalize signatures to const-ref/pointer style; minor refactors and error-mapping for new RPCs.
Ctld gRPC server
src/CraneCtld/RpcService/CtldGrpcServer.cpp
Extend ModifyTask to handle Suspend and Resume attributes, invoking TaskScheduler suspend/resume and returning per-task modified/not_modified with reasons.
Craned service (server & impl)
src/Craned/Core/CranedServer.h, src/Craned/Core/CranedServer.cpp
Add SuspendJobs and ResumeJobs RPC handlers; adjust ServerContext/request/response pointer spacing and loop bindings; implement per-job/per-step processing for suspend/resume.
Supervisor stub / keeper
src/Craned/Core/SupervisorKeeper.h, src/Craned/Core/SupervisorKeeper.cpp
Add SupervisorStub::SuspendJob()/ResumeJob() wrappers calling Supervisor RPCs and mapping replies to CraneErrCode.
Supervisor server
src/Craned/Supervisor/SupervisorServer.h, src/Craned/Supervisor/SupervisorServer.cpp
Add SuspendJob and ResumeJob RPC handlers that call TaskManager->SuspendJobAsync()/ResumeJobAsync(), await result, and populate ok/reason fields.
Supervisor TaskManager & instances
src/Craned/Supervisor/TaskManager.h, src/Craned/Supervisor/TaskManager.cpp
Add ITaskInstance::Suspend/Resume; ProcInstance and ContainerInstance implementations (SIGSTOP/SIGCONT or Freeze/Thaw cgroup); TaskManager async API SuspendJobAsync()/ResumeJobAsync(), TaskSignalQueueElem, async handle and queue processing, and internal SuspendRunningTasks_/ResumeSuspendedTasks_.
Cgroup freezer & allocator changes
src/Craned/Common/CgroupManager.cpp, src/Craned/Common/CgroupManager.h
Add freezer (Freeze/Thaw) helpers and public FreezeCgroupByPath/ThawCgroupByPath; many const-ref/pointer-signature normalizations and allocator signature tweaks.
Craned RPC helpers & signature normalizations
src/Craned/Core/CranedServer.*, src/CraneCtld/RpcService/*, src/Craned/Core/*
Add SuspendJobs/ResumeJobs handlers and extensive spacing/const-ref/loop-binding style changes across RPC handlers and signatures.
Misc / formatting & small edits
.clang-format, CMakeLists.txt, src/Craned/Core/JobManager.h
Remove a TODO line; comment out a clang-format parameter; widespread formatting/spacing updates that tighten reference/pointer syntax.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CraneCtld as CraneCtld\n(TaskScheduler)
    participant Craned as Craned\n(CranedService)
    participant Supervisor as Supervisor\n(TaskManager)
    participant Task as TaskInstance

    Note over Client,CraneCtld: Suspend flow
    Client->>CraneCtld: ModifyTask(task_id, Suspend)
    CraneCtld->>CraneCtld: SuspendRunningTasks()
    CraneCtld->>Craned: SuspendJobs RPC (broadcast)
    Craned->>Supervisor: SuspendJob RPC (per-job)
    Supervisor->>Supervisor: SuspendJobAsync() enqueue/process
    Supervisor->>Task: Suspend() (SIGSTOP / freeze cgroup)
    Task-->>Supervisor: result
    Supervisor-->>Craned: ok/reason
    Craned-->>CraneCtld: aggregated results
    CraneCtld->>CraneCtld: mark Suspended + persist / or terminate on failure
    CraneCtld-->>Client: response (modified/not_modified)

    Note over Client,CraneCtld: Resume flow
    Client->>CraneCtld: ModifyTask(task_id, Resume)
    CraneCtld->>CraneCtld: ResumeSuspendedTasks()
    CraneCtld->>Craned: ResumeJobs RPC (broadcast)
    Craned->>Supervisor: ResumeJob RPC (per-job)
    Supervisor->>Supervisor: ResumeJobAsync() enqueue/process
    Supervisor->>Task: Resume() (SIGCONT / thaw cgroup)
    Task-->>Supervisor: result
    Supervisor-->>Craned: ok/reason
    Craned-->>CraneCtld: aggregated results
    CraneCtld->>CraneCtld: mark Running + persist / or terminate on failure
    CraneCtld-->>Client: response (modified/not_modified)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas needing extra attention:

  • src/CraneCtld/TaskScheduler.cpp — concurrency, RPC broadcast aggregation, partial-failure handling, DB + in-memory state transitions.
  • src/Craned/Supervisor/TaskManager.{h,cpp} — async queue/promise lifetimes, event-loop integration, correct signal delivery for processes and containers.
  • src/Craned/Common/CgroupManager.{cpp,h} — freezer implementation (v1/v2 differences), file IO error handling, and API surface changes.
  • Cross-service protobuf/service compatibility and generated code consistency for new RPCs and enums.

Possibly related PRs

Suggested reviewers

  • L-Xiafeng
  • RileyWen
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.61% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'feat: add suspend and resume' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changeset. Provide a more specific title that clearly indicates the primary change, such as 'feat: add job suspend and resume functionality' or 'feat: implement SuspendJobs and ResumeJobs RPC endpoints'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 71c45bd and bc24c3c.

📒 Files selected for processing (18)
  • protos/Crane.proto (3 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • protos/Supervisor.proto (2 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (1 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (1 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/CranedServer.cpp (2 hunks)
  • src/Craned/Core/CranedServer.h (1 hunks)
  • src/Craned/Core/JobManager.cpp (3 hunks)
  • src/Craned/Core/JobManager.h (2 hunks)
  • src/Craned/Core/SupervisorKeeper.cpp (2 hunks)
  • src/Craned/Core/SupervisorKeeper.h (1 hunks)
  • src/Craned/Supervisor/SupervisorServer.cpp (2 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
  • src/Craned/Supervisor/TaskManager.h (5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.h
  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
🧬 Code graph analysis (12)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendTask (86-100)
  • SuspendTask (86-86)
  • ResumeTask (102-116)
  • ResumeTask (102-102)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendTask (89-103)
  • SuspendTask (89-92)
  • ResumeTask (105-119)
  • ResumeTask (105-108)
src/Craned/Core/JobManager.h (1)
src/Craned/Core/JobManager.cpp (4)
  • SuspendStep (989-1011)
  • SuspendStep (989-989)
  • ResumeStep (1013-1035)
  • ResumeStep (1013-1013)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendTask (86-100)
  • SuspendTask (86-86)
  • ResumeTask (102-116)
  • ResumeTask (102-102)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendTask (89-103)
  • SuspendTask (89-92)
  • ResumeTask (105-119)
  • ResumeTask (105-108)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendSteps (241-268)
  • SuspendSteps (241-241)
  • ResumeSteps (270-297)
  • ResumeSteps (270-270)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendSteps (88-113)
  • SuspendSteps (88-91)
  • ResumeSteps (115-140)
  • ResumeSteps (115-118)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendSteps (241-268)
  • SuspendSteps (241-241)
  • ResumeSteps (270-297)
  • ResumeSteps (270-270)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • SuspendRunningTask (999-1062)
  • SuspendRunningTask (999-999)
  • ResumeSuspendedTask (1064-1123)
  • ResumeSuspendedTask (1064-1064)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendTask (86-100)
  • SuspendTask (86-86)
  • ResumeTask (102-116)
  • ResumeTask (102-102)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendTask (89-103)
  • SuspendTask (89-92)
  • ResumeTask (105-119)
  • ResumeTask (105-108)
src/Craned/Core/CranedServer.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendSteps (241-268)
  • SuspendSteps (241-241)
  • ResumeSteps (270-297)
  • ResumeSteps (270-270)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendSteps (88-113)
  • SuspendSteps (88-91)
  • ResumeSteps (115-140)
  • ResumeSteps (115-118)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
  • g_embedded_db_client (42-45)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendSteps (88-113)
  • SuspendSteps (88-91)
  • ResumeSteps (115-140)
  • ResumeSteps (115-118)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (103-106)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
  • SuspendTaskAsync (1694-1701)
  • SuspendTaskAsync (1694-1694)
  • ResumeTaskAsync (1703-1710)
  • ResumeTaskAsync (1703-1703)
  • EvCleanTaskSignalQueueCb_ (1951-1978)
  • EvCleanTaskSignalQueueCb_ (1951-1951)
  • SuspendRunningTasks_ (1980-1995)
  • SuspendRunningTasks_ (1980-1980)
  • ResumeSuspendedTasks_ (1997-2012)
  • ResumeSuspendedTasks_ (1997-1997)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: format
🔇 Additional comments (12)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)

588-606: LGTM!

The suspend branch correctly calls SuspendRunningTask and provides appropriate error messages for different failure scenarios.


607-625: LGTM!

The resume branch correctly calls ResumeSuspendedTask and provides clear error messages distinguishing between non-existent tasks and tasks that are not currently suspended.

src/Craned/Core/JobManager.h (2)

40-40: LGTM!

The suspended flag addition to StepInstance appropriately tracks the suspended state for each step, and is correctly default-initialized to false.


107-109: LGTM!

The method declarations are well-positioned in the public API and follow the established pattern for task lifecycle management operations.

protos/PublicDefs.proto (1)

117-117: LGTM!

The Suspended status addition is correctly positioned in the TaskStatus enum and aligns with the new suspend/resume lifecycle semantics implemented across the PR.

src/Craned/Core/SupervisorKeeper.h (1)

38-39: LGTM!

The method declarations are appropriately positioned in the SupervisorStub class and follow the established pattern for supervisor RPC wrappers.

src/CraneCtld/TaskScheduler.h (1)

488-489: LGTM!

The method declarations extend the public API surface appropriately and align with the implementations in TaskScheduler.cpp for suspending and resuming running tasks.

src/Craned/Core/SupervisorKeeper.cpp (3)

22-22: LGTM!

The include addition is necessary for the CraneErrStr function used in the error mapping logic of the new RPC wrappers.


86-100: LGTM!

The SuspendTask implementation correctly follows the established RPC wrapper pattern with appropriate error handling and logging.


102-116: LGTM!

The ResumeTask implementation mirrors SuspendTask and correctly handles RPC invocation and error mapping.

src/CraneCtld/RpcService/CranedKeeper.h (1)

89-91: LGTM!

The method declarations appropriately extend the CranedStub public API for suspend/resume operations, following the established pattern for multi-step RPC operations.

src/Craned/Core/CranedServer.h (1)

70-76: LGTM!

The RPC handler declarations correctly extend the CranedServiceImpl interface for suspend/resume operations, following the established gRPC service pattern.

@github-actions github-actions bot added the test-passed Build and test success label Oct 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

1908-1925: Preserve suspended state through termination flow.

Clearing m_step_.suspended at the top of termination makes concurrent resume requests succeed while the task is still SIGSTOP’ed. Wait until the termination completes (or the status change arrives) before resetting it, otherwise higher layers believe the task is resumable and hit invalid transitions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc24c3c and 224d2c7.

📒 Files selected for processing (3)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (3 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: format

@github-actions github-actions bot removed the test-passed Build and test success label Oct 5, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/Craned/Core/JobManager.cpp (2)

975-977: Don’t clear suspension before termination succeeds

This is the same issue previously flagged: you zero step_inst->suspended for every orphaned step before we know whether TerminateTask will succeed. If the RPC fails we leave the supervisor running but record suspended = false, so a controller retry mis-evaluates state. Only flip the flag after termination completes (or in the status-change callback).


1004-1014: Check supervisor availability before validating state

We’re still dereferencing step_inst and returning ERR_INVALID_PARAM before verifying that the supervisor stub exists. If the supervisor already disappeared we must return ERR_RPC_FAILURE; otherwise upper layers mis-handle the retry. Acquire the stub first, fail with ERR_RPC_FAILURE when it’s missing, then validate step_inst->suspended.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 224d2c7 and 60b6694.

📒 Files selected for processing (3)
  • src/CraneCtld/TaskScheduler.cpp (2 hunks)
  • src/Craned/Core/JobManager.cpp (3 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
  • g_embedded_db_client (42-45)
🔇 Additional comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

1022-1033: Nice improvement

Great to see ContainerInstance::Kill short-circuit SIGSTOP/SIGCONT. This keeps suspend/resume from tearing containers down, matching the earlier feedback.

string reason = 2;
}

message SuspendTaskRequest {}
Copy link
Collaborator

@L-Xiafeng L-Xiafeng Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task->Job

return CraneErrCode::SUCCESS;
}

CraneErrCode CranedStub::SuspendSteps(const std::vector<task_id_t> &task_ids) {
Copy link
Collaborator

@L-Xiafeng L-Xiafeng Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector<job_id_t>

return std::move(future);
}

CraneErrCode TaskScheduler::SuspendRunningTask(task_id_t task_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不支持批量吗?

std::vector<CranedId> suspended_nodes;
CraneErrCode failure_code = CraneErrCode::SUCCESS;
bool has_failure = false;
for (const auto& craned_id : executing_nodes) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先查下任务所有节点是不是Up

@L-Xiafeng
Copy link
Collaborator

L-Xiafeng commented Oct 9, 2025

文档需要在仓库doc下更新

struct StepInstance {
StepToD step_to_d;
pid_t supv_pid;
bool suspended{false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接用TaskStatus

bool pty;
bool x11;
bool x11_fwd;
bool suspended{false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也用TaskStatus

@L-Xiafeng
Copy link
Collaborator

这两个操作应该都只支持对Job进行操作

return CraneErrCode::ERR_GENERIC_FAILURE;
}

CraneErrCode CranedStub::ResumeSteps(const std::vector<task_id_t> &task_ids) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job_id_t

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/CraneCtld/TaskScheduler.cpp (3)

1055-1072: Track rollback success and update Ctld state consistently after rollback.

After rolling back the suspended nodes (by calling ResumeSteps), the code returns the original failure_code without verifying if the rollback succeeded. If rollback fails partially, the controller's view of task state diverges from the actual craned state. After successful rollback, the task should explicitly remain in Running state, but if rollback fails, the state is now inconsistent and should be logged or handled differently.

Based on learnings from past review comments (lines 1056-1073, 1146-1164).

Track rollback outcome:

   if (has_failure) {
+    bool rollback_failed = false;
     for (const auto& craned_id : suspended_nodes) {
       auto stub = g_craned_keeper->GetCranedStub(craned_id);
       if (!stub || stub->Invalid()) {
         CRANE_ERROR(
             "Failed to rollback suspended state on craned {}: stub unavailable",
             craned_id);
+        rollback_failed = true;
         continue;
       }
 
       CraneErrCode err = stub->ResumeSteps(job_ids);
       if (err != CraneErrCode::SUCCESS) {
         CRANE_ERROR("Rollback ResumeSteps for task #{} on craned {} failed: {}",
                     task_id, craned_id, CraneErrStr(err));
+        rollback_failed = true;
       }
     }
 
+    if (rollback_failed) {
+      CRANE_ERROR(
+          "Task #{} is in inconsistent state after failed rollback. "
+          "Manual intervention may be required.", task_id);
+    }
     return failure_code;
   }

1033-1053: Use thread pool for parallel RPC calls instead of sequential iteration.

The current implementation calls SuspendSteps sequentially on each executing node, which can be slow when a task spans many nodes. For better performance, dispatch these RPCs in parallel using g_thread_pool, similar to how CreateCgroupForTasks is handled (lines 643-696).

Based on learnings from past review comment (line 1050).

Refactor to use parallel execution:

   std::vector<task_id_t> job_ids{task_id};
   std::vector<CranedId> suspended_nodes;
+  Mutex mtx;
   CraneErrCode failure_code = CraneErrCode::SUCCESS;
   bool has_failure = false;
+  absl::BlockingCounter counter(executing_nodes.size());
+
   for (const auto& craned_id : executing_nodes) {
-    auto stub = g_craned_keeper->GetCranedStub(craned_id);
-    if (!stub || stub->Invalid()) {
-      CRANE_WARN("SuspendSteps stub for {} unavailable", craned_id);
-      if (failure_code == CraneErrCode::SUCCESS)
-        failure_code = CraneErrCode::ERR_RPC_FAILURE;
-      has_failure = true;
-      continue;
-    }
-
-    CraneErrCode err = stub->SuspendSteps(job_ids);
-    if (err != CraneErrCode::SUCCESS) {
-      CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id,
-                  craned_id, CraneErrStr(err));
-      if (failure_code == CraneErrCode::SUCCESS) failure_code = err;
-      has_failure = true;
-      continue;
-    }
-
-    suspended_nodes.push_back(craned_id);
+    g_thread_pool->detach_task([&, craned_id]() {
+      auto stub = g_craned_keeper->GetCranedStub(craned_id);
+      if (!stub || stub->Invalid()) {
+        CRANE_WARN("SuspendSteps stub for {} unavailable", craned_id);
+        LockGuard guard(&mtx);
+        if (failure_code == CraneErrCode::SUCCESS)
+          failure_code = CraneErrCode::ERR_RPC_FAILURE;
+        has_failure = true;
+        counter.DecrementCount();
+        return;
+      }
+
+      CraneErrCode err = stub->SuspendSteps(job_ids);
+      LockGuard guard(&mtx);
+      if (err != CraneErrCode::SUCCESS) {
+        CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id,
+                    craned_id, CraneErrStr(err));
+        if (failure_code == CraneErrCode::SUCCESS) failure_code = err;
+        has_failure = true;
+      } else {
+        suspended_nodes.push_back(craned_id);
+      }
+      counter.DecrementCount();
+    });
   }
+  counter.Wait();

1124-1144: Use thread pool for parallel RPC calls in ResumeSuspendedTask.

Same performance issue as SuspendRunningTask: sequential ResumeSteps calls should be parallelized using g_thread_pool for better performance when a task spans many nodes.

Apply the same parallel execution pattern shown in the previous comment:

   std::vector<task_id_t> job_ids{task_id};
   std::vector<CranedId> resumed_nodes;
+  Mutex mtx;
   CraneErrCode failure_code = CraneErrCode::SUCCESS;
   bool has_failure = false;
+  absl::BlockingCounter counter(executing_nodes.size());
+
   for (const auto& craned_id : executing_nodes) {
-    auto stub = g_craned_keeper->GetCranedStub(craned_id);
-    if (!stub || stub->Invalid()) {
-      CRANE_WARN("ResumeSteps stub for {} unavailable", craned_id);
-      if (failure_code == CraneErrCode::SUCCESS)
-        failure_code = CraneErrCode::ERR_RPC_FAILURE;
-      has_failure = true;
-      continue;
-    }
-
-    CraneErrCode err = stub->ResumeSteps(job_ids);
-    if (err != CraneErrCode::SUCCESS) {
-      CRANE_ERROR("Failed to resume task #{} on craned {}: {}", task_id,
-                  craned_id, CraneErrStr(err));
-      if (failure_code == CraneErrCode::SUCCESS) failure_code = err;
-      has_failure = true;
-      continue;
-    }
-
-    resumed_nodes.push_back(craned_id);
+    g_thread_pool->detach_task([&, craned_id]() {
+      auto stub = g_craned_keeper->GetCranedStub(craned_id);
+      if (!stub || stub->Invalid()) {
+        CRANE_WARN("ResumeSteps stub for {} unavailable", craned_id);
+        LockGuard guard(&mtx);
+        if (failure_code == CraneErrCode::SUCCESS)
+          failure_code = CraneErrCode::ERR_RPC_FAILURE;
+        has_failure = true;
+        counter.DecrementCount();
+        return;
+      }
+
+      CraneErrCode err = stub->ResumeSteps(job_ids);
+      LockGuard guard(&mtx);
+      if (err != CraneErrCode::SUCCESS) {
+        CRANE_ERROR("Failed to resume task #{} on craned {}: {}", task_id,
+                    craned_id, CraneErrStr(err));
+        if (failure_code == CraneErrCode::SUCCESS) failure_code = err;
+        has_failure = true;
+      } else {
+        resumed_nodes.push_back(craned_id);
+      }
+      counter.DecrementCount();
+    });
   }
+  counter.Wait();
🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)

1029-1032: Validate that all executing nodes are UP before attempting suspend.

Before sending SuspendSteps RPCs, check that all executing nodes are online using g_meta_container->CheckCranedOnline(). This prevents wasting time attempting RPCs to down nodes and provides clearer error messages.

Based on learnings from past review comment (line 1033).

Add node status validation:

   std::vector<task_id_t> job_ids{task_id};
+  // Validate all nodes are online before attempting suspend
+  for (const auto& craned_id : executing_nodes) {
+    if (!g_meta_container->CheckCranedOnline(craned_id)) {
+      CRANE_ERROR("Task #{} cannot be suspended: craned {} is offline", 
+                  task_id, craned_id);
+      return CraneErrCode::ERR_CRANED_DOWN;
+    }
+  }
+
   std::vector<CranedId> suspended_nodes;

999-999: Consider adding batch suspend/resume support.

The current implementation only handles a single task at a time. Based on learnings from the comment on line 999 ("不支持批量吗?" - No batch support?), consider whether batch operations would be useful for administrative workflows (e.g., suspending all tasks in a partition or for a user).

Do you want batch suspend/resume functionality (operating on multiple task IDs in a single call)? If so, I can help design and generate the batch operation methods.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60b6694 and f87a16b.

📒 Files selected for processing (1)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
  • g_embedded_db_client (42-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format

@github-actions github-actions bot added the test-passed Build and test success label Oct 9, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (5)
src/Craned/Supervisor/SupervisorServer.h (1)

63-71: Same naming inconsistency: SuspendJob vs. ResumeTask.

The declarations mirror the inconsistency noted in the implementation: SuspendJob at line 63 vs. ResumeTask at line 68. These should use consistent terminology.

src/Craned/Core/JobManager.cpp (3)

860-863: Critical: Update only the specific step identified by status_change.

This loop updates the status for every StepInstance in job_ptr->step_map, but status_change should identify a single step. When step A reports a status change, this incorrectly overwrites the status for steps B, C, etc., causing state corruption across unrelated steps.

Use status_change.step_id to locate and update only the target step:

-    for (auto& [step_id, step_inst] : job_ptr->step_map) {
-      if (!step_inst) continue;
-      step_inst->status = status_change.new_status;
-    }
+    auto step_iter = job_ptr->step_map.find(status_change.step_id);
+    if (step_iter == job_ptr->step_map.end()) {
+      // Fallback: scan for matching step_id if not found by key
+      for (auto& [key, step_inst] : job_ptr->step_map) {
+        if (step_inst && step_inst->step_to_d.task_id() == status_change.step_id) {
+          step_iter = job_ptr->step_map.find(key);
+          break;
+        }
+      }
+    }
+    if (step_iter != job_ptr->step_map.end() && step_iter->second) {
+      step_iter->second->status = status_change.new_status;
+    }

968-970: Major: Don't reset step status before termination completes.

Setting all steps to Running before termination occurs clears any Suspended state. If termination later fails (e.g., stub unavailable), the supervisor keeps running but we've lost the suspended flag, so subsequent suspend/resume operations will mis-evaluate state.

Remove this blanket status reset and let the supervisor's termination-confirmation event update the status:

     auto* instance = job_instance.get();
     instance->orphaned = elem.mark_as_orphaned;
-    for (auto& [step_id, step_inst] : instance->step_map) {
-      if (step_inst) step_inst->status = crane::grpc::TaskStatus::Running;
-    }

990-1013: Major: Check supervisor stub availability before validating state.

Checking step_inst->status before verifying the stub exists returns ERR_INVALID_PARAM when the real issue is a missing supervisor (should be ERR_RPC_FAILURE). This masks the actual failure mode and prevents callers from retrying on another node.

Swap the order: acquire stub first, validate state second:

   auto* step_inst = job_instance->step_map.begin()->second.get();
-  if (step_inst->status == crane::grpc::TaskStatus::Suspended) {
-    CRANE_TRACE("Task #{} already suspended", step_id);
-    return CraneErrCode::ERR_INVALID_PARAM;
-  }
-
   auto stub = g_supervisor_keeper->GetStub(step_id);
   if (!stub) {
     CRANE_ERROR("Supervisor for task #{} not found when suspending", step_id);
     return CraneErrCode::ERR_RPC_FAILURE;
   }
+
+  if (step_inst->status == crane::grpc::TaskStatus::Suspended) {
+    CRANE_TRACE("Task #{} already suspended", step_id);
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }

Apply the same fix to ResumeStep (lines 1015-1038).

src/Craned/Supervisor/TaskManager.cpp (1)

1992-2024: Critical: Continue signaling all tasks even after errors.

Both helpers return immediately on the first Kill() error (lines 2002, 2019). If task A's signal succeeds but task B's fails, you've left task A suspended/resumed while reporting failure and not updating m_step_.status. Subsequent suspend/resume calls then see inconsistent state (status says Running but processes are stopped, or vice versa).

Continue sending signals to all tasks, track the first error, and only flip status when all signals succeed:

 CraneErrCode TaskManager::SuspendRunningTasks_() {
   if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT;
 
   bool any_signal_sent = false;
+  CraneErrCode first_error = CraneErrCode::SUCCESS;
   for (const auto& [pid, task_id] : m_pid_task_id_map_) {
     auto task = m_step_.GetTaskInstance(task_id);
     if (!task) continue;
     if (task->GetPid() == 0) continue;
     any_signal_sent = true;
     CraneErrCode err = task->Kill(SIGSTOP);
-    if (err != CraneErrCode::SUCCESS) return err;
+    if (err != CraneErrCode::SUCCESS && first_error == CraneErrCode::SUCCESS)
+      first_error = err;
   }
 
   if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM;
-  return CraneErrCode::SUCCESS;
+  return first_error;
 }

Apply the same fix to ResumeSuspendedTasks_() (lines 2009-2024).

🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)

1095-1116: Consider tracking rollback success for better operational visibility.

The rollback loop logs individual node failures but doesn't track overall rollback success. If rollback partially fails, operators lack visibility into the inconsistent controller/craned state.

Apply this diff to track rollback outcomes:

   if (has_failure.load(std::memory_order_relaxed)) {
+    bool rollback_failed = false;
     for (const auto& craned_id : suspended_nodes) {
       auto stub = g_craned_keeper->GetCranedStub(craned_id);
       if (!stub || stub->Invalid()) {
         CRANE_ERROR(
             "Failed to rollback suspended state on craned {}: stub "
             "unavailable",
             craned_id);
+        rollback_failed = true;
         continue;
       }
 
       CraneErrCode err = stub->ResumeSteps(job_ids);
       if (err != CraneErrCode::SUCCESS) {
         CRANE_ERROR(
             "Rollback ResumeSteps for task #{} on craned {} failed: {}",
             task_id, craned_id, CraneErrStr(err));
+        rollback_failed = true;
       }
     }
 
+    if (rollback_failed) {
+      CRANE_ERROR(
+          "Task #{} is in inconsistent state after failed rollback. "
+          "Manual intervention may be required.", task_id);
+    }
     results.emplace_back(failure_code);
     continue;
   }

Based on learnings from past review comments.


1234-1254: Consider tracking rollback success for better operational visibility.

The rollback loop logs individual node failures but doesn't track overall rollback success. If rollback partially fails, operators lack visibility into the inconsistent controller/craned state.

Apply this diff to track rollback outcomes:

   if (has_failure.load(std::memory_order_relaxed)) {
+    bool rollback_failed = false;
     for (const auto& craned_id : resumed_nodes) {
       auto stub = g_craned_keeper->GetCranedStub(craned_id);
       if (!stub || stub->Invalid()) {
         CRANE_ERROR(
             "Failed to rollback resumed state on craned {}: stub unavailable",
             craned_id);
+        rollback_failed = true;
         continue;
       }
 
       CraneErrCode err = stub->SuspendSteps(job_ids);
       if (err != CraneErrCode::SUCCESS) {
         CRANE_ERROR(
             "Rollback SuspendSteps for task #{} on craned {} failed: {}",
             task_id, craned_id, CraneErrStr(err));
+        rollback_failed = true;
       }
     }
 
+    if (rollback_failed) {
+      CRANE_ERROR(
+          "Task #{} is in inconsistent state after failed rollback. "
+          "Manual intervention may be required.", task_id);
+    }
     results.emplace_back(failure_code);
     continue;
   }

Based on learnings from past review comments.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f87a16b and fb9eb76.

📒 Files selected for processing (14)
  • protos/Supervisor.proto (2 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (1 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (1 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/JobManager.cpp (3 hunks)
  • src/Craned/Core/JobManager.h (2 hunks)
  • src/Craned/Core/SupervisorKeeper.cpp (2 hunks)
  • src/Craned/Core/SupervisorKeeper.h (1 hunks)
  • src/Craned/Supervisor/SupervisorServer.cpp (2 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (7 hunks)
  • src/Craned/Supervisor/TaskManager.h (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Craned/Core/JobManager.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.h
  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
🧬 Code graph analysis (8)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (2)
  • SuspendJob (86-100)
  • SuspendJob (86-86)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (89-103)
  • SuspendJob (89-92)
  • ResumeTask (105-119)
  • ResumeTask (105-108)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendSteps (241-268)
  • SuspendSteps (241-241)
  • ResumeSteps (270-297)
  • ResumeSteps (270-270)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendSteps (88-113)
  • SuspendSteps (88-91)
  • ResumeSteps (115-140)
  • ResumeSteps (115-118)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendSteps (88-113)
  • SuspendSteps (88-91)
  • ResumeSteps (115-140)
  • ResumeSteps (115-118)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (103-106)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
  • SuspendJobAsync (1704-1711)
  • SuspendJobAsync (1704-1704)
  • ResumeJobAsync (1713-1720)
  • ResumeJobAsync (1713-1713)
  • EvCleanTaskSignalQueueCb_ (1961-1990)
  • EvCleanTaskSignalQueueCb_ (1961-1961)
  • SuspendRunningTasks_ (1992-2007)
  • SuspendRunningTasks_ (1992-1992)
  • ResumeSuspendedTasks_ (2009-2024)
  • ResumeSuspendedTasks_ (2009-2009)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • SuspendRunningTasks (999-1142)
  • SuspendRunningTasks (999-1000)
  • ResumeSuspendedTasks (1144-1279)
  • ResumeSuspendedTasks (1144-1145)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (86-100)
  • SuspendJob (86-86)
  • ResumeJob (102-116)
  • ResumeJob (102-102)
src/Craned/Supervisor/SupervisorServer.cpp (2)
  • SuspendJob (89-103)
  • SuspendJob (89-92)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (2)
  • SuspendJob (89-103)
  • SuspendJob (89-92)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (2)
  • SuspendJob (86-100)
  • SuspendJob (86-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (10)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)

241-297: LGTM: Implementation follows established RPC patterns.

Both SuspendSteps and ResumeSteps correctly follow the existing RPC invocation patterns in this file:

  • Proper deadline configuration
  • Consistent error handling via HandleGrpcErrorCode_
  • Appropriate logging for both success and failure cases
  • Correct use of UpdateLastActiveTime()
src/Craned/Core/SupervisorKeeper.h (1)

38-39: LGTM: Method names align with Job-level operations.

The naming SuspendJob and ResumeJob correctly indicates these methods operate on Job objects, consistent with the PR comment. Since the Supervisor manages a single task per instance (as per learnings), not requiring task parameters is appropriate.

Based on learnings.

src/Craned/Core/SupervisorKeeper.cpp (1)

86-116: LGTM: RPC wrappers follow established patterns.

Both SuspendJob and ResumeJob:

  • Follow the existing RPC invocation pattern
  • Include appropriate error logging
  • Map specific error reasons to error codes (ERR_INVALID_PARAM, ERR_NON_EXISTENT)
  • Return generic failure for unexpected errors

The implementation is consistent with other RPC wrappers in this file.

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

588-633: LGTM: Comprehensive error handling for Suspend/Resume operations.

The implementation correctly:

  • Collects task IDs and invokes batch TaskScheduler operations
  • Provides distinct error messages for different failure scenarios:
    • ERR_NON_EXISTENT: "Task not found in running queue"
    • ERR_INVALID_PARAM: "Task is not running/suspended"
    • Generic failures: "Failed to suspend/resume task"
  • Follows the established pattern for other ModifyTask attributes

The error handling clarity will help users diagnose issues quickly.

protos/Supervisor.proto (1)

122-147: Verify proto consistency with implementation.

The proto defines SuspendJob and ResumeTask RPCs. Ensure these names align with:

  1. The method implementations in SupervisorServer.cpp (which currently mirror this inconsistency)
  2. The intended semantics per PR comments (Job-level operations)

Consider standardizing to either "Job" or "Task" terminology throughout.

src/CraneCtld/TaskScheduler.h (1)

488-491: LGTM! Method declarations are consistent with implementation.

The suspend/resume method signatures correctly match their implementations in TaskScheduler.cpp, both returning std::vector<CraneErrCode> to provide per-task error reporting.

src/Craned/Supervisor/TaskManager.h (1)

56-56: LGTM! Suspend/resume infrastructure follows existing patterns.

The additions are well-integrated:

  • The status field (line 56) replaces a boolean flag with a more expressive enum
  • Public async API (lines 359-361) matches existing async method conventions
  • Private signal queue infrastructure (lines 384-387, 421-422) mirrors other concurrent queue handlers in the class

Also applies to: 359-361, 384-387, 397-397, 402-403, 421-422

src/Craned/Supervisor/TaskManager.cpp (3)

25-25: LGTM! Container suspend handling and infrastructure setup.

The changes correctly implement non-destructive suspend/resume for containers:

  • Line 25: signal.h import needed for SIGSTOP/SIGCONT constants
  • Lines 1024-1032: Short-circuits container kill/delete workflow for suspend/resume signals, sending only the raw signal to the process group
  • Lines 1541-1545: Async handle setup follows the existing pattern for other queue handlers
  • Line 1607: Status field assignment replaces the deprecated boolean flag

Also applies to: 1024-1032, 1541-1545, 1607-1607


1704-1720: LGTM! Async suspend/resume API.

The public async methods correctly enqueue signal elements and return futures for the results, following the established pattern used by other async operations in this class.


1961-1990: LGTM! Signal queue handler with proper state guards.

The queue callback correctly validates current status before attempting suspend/resume operations and only updates m_step_.status when the operation succeeds.

Comment on lines 89 to 91
CraneErrCode SuspendSteps(const std::vector<job_id_t> &job_ids);

CraneErrCode ResumeSteps(const std::vector<job_id_t> &job_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Clarify naming: method names suggest "Steps" but parameters suggest "Jobs".

The method names SuspendSteps and ResumeSteps suggest they operate on steps, but the parameter type std::vector<job_id_t> suggests they operate on jobs. Based on the PR comment stating "The two operations (suspend and resume) should only support operations on Job objects," consider renaming these methods to SuspendJobs and ResumeJobs for consistency.

Additionally, note the inconsistency with TerminateSteps at line 93 which uses std::vector<task_id_t>. If job_id_t and task_id_t are distinct types, clarify the intended semantics.


🏁 Script executed:

#!/bin/bash
# Check the type definitions and usage patterns for job_id_t, task_id_t, and step_id_t
rg -n "using (job_id_t|task_id_t|step_id_t)" --type=cpp --type=h

Length of output: 259


🏁 Script executed:

#!/bin/bash
rg -nP '(SuspendSteps|ResumeSteps|TerminateSteps)' -C3 src/CraneCtld/RpcService/CranedKeeper.h

Length of output: 428


Rename methods to match parameter types

  • SuspendSteps/ResumeSteps take vector<job_id_t> → rename to SuspendJobs/ResumeJobs.
  • TerminateSteps takes vector<task_id_t> → rename to TerminateTasks (and likewise TerminateOrphanedStepsTerminateOrphanedTasks).

auto iter = m_running_task_map_.find(task_id);
if (iter == m_running_task_map_.end()) {
CRANE_TRACE("Task #{} not in Rn queue for suspend", task_id);
results.emplace_back(CraneErrCode::ERR_NON_EXISTENT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERR_INVALID_PARAM

offline_nodes.reserve(executing_nodes.size());
for (const auto& craned_id : executing_nodes) {
if (!g_meta_container->CheckCranedOnline(craned_id))
offline_nodes.push_back(craned_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以直接结束,没必要继续查

// Ensure suspension only proceeds when every craned involved is online.
CRANE_WARN("Task #{} suspend skipped because craned(s) {} are offline.",
task_id, absl::StrJoin(offline_nodes, ","));
results.emplace_back(CraneErrCode::ERR_NO_AVAIL_NODE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERR_NO_AVAIL_NODE不是这个意思

continue;
}

std::vector<job_id_t> job_ids{static_cast<job_id_t>(task_id)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该不用cast?这两个类型都是using uint32


CraneErrCode err = stub->SuspendSteps(job_ids);
if (err != CraneErrCode::SUCCESS) {
CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果部分节点Suspend了,这里Resume了一部分但是现在一部分是Running,一部分节点Down Suspended,节点再上线的时候怎么恢复?

break;
}
result = SuspendRunningTasks_();
if (result == CraneErrCode::SUCCESS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加个检查吧, m_step_.status首先应该是Running

}
result = ResumeSuspendedTasks_();
if (result == CraneErrCode::SUCCESS)
m_step_.status = crane::grpc::TaskStatus::Running;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也加检查

if (!task) continue;
if (task->GetPid() == 0) continue;
any_signal_sent = true;
CraneErrCode err = task->Kill(SIGSTOP);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里改一下,给TaskInstance加Suspend/Resume的函数,要支持容器

@github-actions github-actions bot removed the test-passed Build and test success label Oct 13, 2025
@L-Xiafeng L-Xiafeng added the enhancement New feature or request label Oct 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/CraneCtld/TaskScheduler.cpp (1)

1238-1254: Track rollback outcome and log inconsistent state.

The rollback loop issues SuspendSteps to revert resumed nodes but doesn't track whether those calls succeed. If rollback fails, controller and craned states diverge—some nodes remain running while Ctld thinks the operation failed. This complicates recovery and diagnostics.

Based on learnings from past review comments.

Apply this diff to track rollback failures:

     if (has_failure.load(std::memory_order_relaxed)) {
+      bool rollback_failed = false;
       for (const auto& craned_id : resumed_nodes) {
         auto stub = g_craned_keeper->GetCranedStub(craned_id);
         if (!stub || stub->Invalid()) {
           CRANE_ERROR(
               "Failed to rollback resumed state on craned {}: stub unavailable",
               craned_id);
+          rollback_failed = true;
           continue;
         }
 
         CraneErrCode err = stub->SuspendSteps(job_ids);
         if (err != CraneErrCode::SUCCESS) {
           CRANE_ERROR(
               "Rollback SuspendSteps for task #{} on craned {} failed: {}",
               task_id, craned_id, CraneErrStr(err));
+          rollback_failed = true;
         }
       }
 
+      if (rollback_failed) {
+        CRANE_ERROR(
+            "Task #{} is in inconsistent state after failed rollback. "
+            "Manual intervention may be required.", task_id);
+      }
       results.emplace_back(failure_code);
       continue;
     }

</review_comment_end>

src/Craned/Supervisor/TaskManager.cpp (1)

2000-2032: Do not short-circuit on first error; continue signaling and aggregate — prevents stranding partial sets

Currently, SuspendRunningTasks_/ResumeSuspendedTasks_ return on first failure, potentially leaving earlier tasks paused/resumed while the step status remains unchanged. Continue signaling every task, collect the first error, and only flip state when ALL succeed. This was previously flagged and appears reintroduced.

Apply these diffs:

@@
 CraneErrCode TaskManager::SuspendRunningTasks_() {
   if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT;

-  bool any_signal_sent = false;
+  bool any_signal_sent = false;
+  CraneErrCode first_err = CraneErrCode::SUCCESS;
   for (const auto& [pid, task_id] : m_pid_task_id_map_) {
     auto task = m_step_.GetTaskInstance(task_id);
     if (!task) continue;
     if (task->GetPid() == 0) continue;
     any_signal_sent = true;
     CraneErrCode err = task->Kill(SIGSTOP);
-    if (err != CraneErrCode::SUCCESS) return err;
+    if (err != CraneErrCode::SUCCESS && first_err == CraneErrCode::SUCCESS)
+      first_err = err;
   }

-  if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM;
-  return CraneErrCode::SUCCESS;
+  if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM;
+  return first_err;
 }
@@
 CraneErrCode TaskManager::ResumeSuspendedTasks_() {
   if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT;

-  bool any_signal_sent = false;
+  bool any_signal_sent = false;
+  CraneErrCode first_err = CraneErrCode::SUCCESS;
   for (const auto& [pid, task_id] : m_pid_task_id_map_) {
     auto task = m_step_.GetTaskInstance(task_id);
     if (!task) continue;
     if (task->GetPid() == 0) continue;
     any_signal_sent = true;
     CraneErrCode err = task->Kill(SIGCONT);
-    if (err != CraneErrCode::SUCCESS) return err;
+    if (err != CraneErrCode::SUCCESS && first_err == CraneErrCode::SUCCESS)
+      first_err = err;
   }

-  if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM;
-  return CraneErrCode::SUCCESS;
+  if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM;
+  return first_err;
 }

Optional: add explicit ITaskInstance::Suspend()/Resume() wrappers and call those here to encapsulate container-specific behavior instead of passing raw signals.

-    CraneErrCode err = task->Kill(SIGSTOP);
+    CraneErrCode err = task->Suspend();
@@
-    CraneErrCode err = task->Kill(SIGCONT);
+    CraneErrCode err = task->Resume()

I can draft the wrapper implementations if desired.

🧹 Nitpick comments (5)
src/CraneCtld/TaskScheduler.cpp (4)

1037-1044: Minor: Offline check breaks early but reserves space for all nodes.

The loop breaks after finding the first offline node (line 1042), so offline_nodes will contain at most one node, yet space is reserved for all executing_nodes.size(). While this works correctly (any offline node is enough to fail), consider either removing the break to collect all offline nodes for better logging, or removing the reserve since only one node is collected.

</review_comment_end>


1129-1138: Simplify redundant status check.

The condition if (task->Status() != crane::grpc::Suspended) at line 1129 should always be true at this point—the task was verified as Running earlier (lines 1021-1026) and no code path between changes it to Suspended. The check adds no value and can be simplified.

Apply this diff:

       TaskInCtld* task = iter->second.get();
-      if (task->Status() != crane::grpc::Suspended) {
-        task->SetStatus(crane::grpc::Suspended);
-        if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists(
-                0, task->TaskDbId(), task->RuntimeAttr())) {
-          CRANE_ERROR("Failed to persist suspended status for task #{}",
-                      task_id);
-          results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE);
-          continue;
-        }
+      task->SetStatus(crane::grpc::Suspended);
+      if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists(
+              0, task->TaskDbId(), task->RuntimeAttr())) {
+        CRANE_ERROR("Failed to persist suspended status for task #{}",
+                    task_id);
+        results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE);
+        continue;
       }

</review_comment_end>


1179-1186: Minor: Offline check breaks early but reserves space for all nodes.

Same pattern as SuspendRunningTasks: the loop breaks after the first offline node, so offline_nodes will contain at most one node, yet space is reserved for all. Consider either removing the break to collect all offline nodes for comprehensive logging, or removing the reserve.

</review_comment_end>


1269-1277: Simplify redundant status check.

The condition if (task->Status() != crane::grpc::Running) at line 1269 should always be true—the task was verified as Suspended earlier (lines 1163-1168) and no intervening code changes it to Running. This check is redundant and can be removed.

Apply this diff:

       TaskInCtld* task = iter->second.get();
-      if (task->Status() != crane::grpc::Running) {
-        task->SetStatus(crane::grpc::Running);
-        if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists(
-                0, task->TaskDbId(), task->RuntimeAttr())) {
-          CRANE_ERROR("Failed to persist resumed status for task #{}", task_id);
-          results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE);
-          continue;
-        }
+      task->SetStatus(crane::grpc::Running);
+      if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists(
+              0, task->TaskDbId(), task->RuntimeAttr())) {
+        CRANE_ERROR("Failed to persist resumed status for task #{}", task_id);
+        results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE);
+        continue;
       }

</review_comment_end>

src/Craned/Supervisor/TaskManager.cpp (1)

1961-1998: Suspend/Resume orchestration: confirm external state propagation and timer semantics

  • No status change is reported upstream here. If Ctld/Craned rely on supervisor notifications, wire a StepStatusChange (Suspended/Running) or confirm higher layers already track these transitions.
  • Time-limit timer continues while Suspended. If product semantics require pausing wall-time while suspended, pause/adjust the termination timer when entering Suspended and restore on Resume. Otherwise, document that wall-time continues.

Would you like a small patch to optionally pause the termination timer during Suspended?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb9eb76 and 6aea2cf.

📒 Files selected for processing (2)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (7 hunks)
🔇 Additional comments (7)
src/Craned/Supervisor/TaskManager.cpp (7)

25-25: Header include for signals — LGTM

Required for SIGSTOP/SIGCONT usage.


1541-1546: Async handle for task signal queue — LGTM

Plumbing is consistent with existing uvw async handles.


1607-1607: State assignment on cleanup — LGTM

Setting m_step_.status here aligns with one-shot cleanup and avoids clearing suspend state elsewhere.


1704-1711: SuspendJobAsync queue plumbing — LGTM

Future-based API and async signaling are appropriate.


1713-1720: ResumeJobAsync queue plumbing — LGTM

Symmetric to suspend; looks good.


1907-1907: Resetting status to Running before termination — explain intent

This may surprise future readers and upstream expectations. Add a brief comment explaining why we force Running (e.g., to pass gating in suspend/resume paths) and confirm no premature status propagation to Ctld occurs here.


1024-1032: Non-destructive suspend/resume for containers — verified runtime mode Default runc run command in etc/config.yaml doesn’t include --detach/--detached, so SIGSTOP/SIGCONT via kill(-m_pid_, …) will correctly pause and resume containers.

@github-actions github-actions bot added the test-passed Build and test success label Oct 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/Craned/Core/JobManager.h (1)

128-131: These methods only suspend/resume the first step, not all steps of a job.

Per the PR comment from L-Xiafeng stating "The two operations (suspend and resume) should only support operations on Job objects," and as previously flagged, the current implementation uses step_map.begin()->second.get() which arbitrarily operates on only the first step of a job. To properly support job-level suspend/resume, these methods should iterate over all steps in the job's step_map and suspend/resume each one.

Consider renaming to SuspendJob/ResumeJob to clarify the job-scoped semantics, and update the implementation to iterate all steps:

-  CraneErrCode SuspendStep(task_id_t task_id);
+  CraneErrCode SuspendJob(job_id_t job_id);

-  CraneErrCode ResumeStep(task_id_t task_id);
+  CraneErrCode ResumeJob(job_id_t job_id);

Then in JobManager.cpp, iterate over all steps:

CraneErrCode JobManager::SuspendJob(job_id_t job_id) {
  auto job_instance = m_job_map_.GetValueExclusivePtr(job_id);
  if (!job_instance || job_instance->step_map.empty()) {
    CRANE_DEBUG("Suspend request for nonexistent job #{}", job_id);
    return CraneErrCode::ERR_NON_EXISTENT;
  }

  CraneErrCode result = CraneErrCode::SUCCESS;
  for (auto& [step_id, step_inst] : job_instance->step_map) {
    if (step_inst->status == StepStatus::Suspended) {
      CRANE_TRACE("Step #{} of job #{} already suspended", step_id, job_id);
      continue;
    }
    
    auto stub = g_supervisor_keeper->GetStub(step_id);
    if (!stub) {
      CRANE_ERROR("Supervisor for step #{} of job #{} not found when suspending", step_id, job_id);
      result = CraneErrCode::ERR_RPC_FAILURE;
      continue;
    }
    
    CraneErrCode err = stub->SuspendJob();
    if (err == CraneErrCode::SUCCESS) {
      step_inst->status = StepStatus::Suspended;
    } else {
      result = err;
    }
  }
  return result;
}
🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)

1316-1316: Consider using m_rpc_worker_pool_ for consistency with other RPC broadcasts.

The suspend RPC broadcast uses g_thread_pool, but other RPC operations in this file (AllocJobs, AllocSteps, FreeJobs, FreeSteps) consistently use m_rpc_worker_pool_. For consistency and to follow the established pattern where RPC-heavy operations use the dedicated RPC worker pool, consider switching to m_rpc_worker_pool_->detach_task.

Based on learnings from past review comments and coding guidelines.

-      g_thread_pool->detach_task([&, craned_id]() {
+      m_rpc_worker_pool_->detach_task([&, craned_id]() {

1457-1457: Consider using m_rpc_worker_pool_ for consistency with other RPC broadcasts.

Same issue as in SuspendRunningTasks: the resume RPC broadcast uses g_thread_pool, but should use m_rpc_worker_pool_ for consistency with other RPC operations in this file.

-      g_thread_pool->detach_task([&, craned_id]() {
+      m_rpc_worker_pool_->detach_task([&, craned_id]() {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17cb1ae and cb1f358.

📒 Files selected for processing (3)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (1 hunks)
  • src/Craned/Core/JobManager.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Craned/Core/JobManager.cpp
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
🧬 Code graph analysis (2)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (843-851)
src/Craned/Core/JobManager.h (1)
src/Craned/Core/JobManager.cpp (4)
  • SuspendStep (1180-1202)
  • SuspendStep (1180-1180)
  • ResumeStep (1204-1226)
  • ResumeStep (1204-1204)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format
🔇 Additional comments (1)
src/Craned/Core/JobManager.h (1)

128-128: Parameter type inconsistency and semantic confusion.

The header declares the parameter as task_id_t task_id, but the implementation in JobManager.cpp uses step_id_t step_id. Additionally, the parameter is used to look up in m_job_map_, which is keyed by job_id_t (line 161). This creates confusion about whether the parameter represents a task ID, step ID, or job ID.

Ensure consistency between the header and implementation, and clarify the parameter's semantics by using the appropriate type (job_id_t if operating on jobs).

⛔ Skipped due to learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/Craned/Core/CranedServer.cpp (2)

88-122: LGTM: SuspendJobs handler correctly implements job-level suspend.

The implementation correctly:

  • Validates server readiness
  • Uses job-level GetStub(job_id) for job-only operations (per PR objectives)
  • Accumulates per-job errors and returns aggregated status
  • Returns Status::OK with error details in the response payload (per learnings)

Consider adding CRANE_WARN logging when individual jobs fail to suspend (lines 104-106, 110-112), similar to how success is logged at line 114. This would aid debugging when operations fail:

     if (!stub) {
       all_ok = false;
       reasons.emplace_back(fmt::format("job {}: Supervisor not found", job_id));
+      CRANE_WARN("Job {}: Supervisor not found", job_id);
       continue;
     }

     CraneErrCode err = stub->SuspendJob();
     if (err != CraneErrCode::SUCCESS) {
       all_ok = false;
       reasons.emplace_back(fmt::format("job {}: {}", job_id, CraneErrStr(err)));
+      CRANE_WARN("Job {} suspend failed: {}", job_id, CraneErrStr(err));
     }

124-157: LGTM: ResumeJobs handler correctly implements job-level resume.

The implementation correctly mirrors the SuspendJobs handler structure and follows the same patterns for readiness checks, error accumulation, and response handling.

Similar to SuspendJobs, consider adding CRANE_WARN logging for individual job failures (lines 138-142, 144-147):

     if (!stub) {
       all_ok = false;
       reasons.emplace_back(fmt::format("job {}: Supervisor not found", job_id));
+      CRANE_WARN("Job {}: Supervisor not found", job_id);
       continue;
     }

     CraneErrCode err = stub->ResumeJob();
     if (err != CraneErrCode::SUCCESS) {
       all_ok = false;
       reasons.emplace_back(fmt::format("job {}: {}", job_id, CraneErrStr(err)));
+      CRANE_WARN("Job {} resume failed: {}", job_id, CraneErrStr(err));
     }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb1f358 and f0f44df.

📒 Files selected for processing (3)
  • src/Craned/Core/CranedServer.cpp (14 hunks)
  • src/Craned/Core/JobManager.h (0 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (5 hunks)
💤 Files with no reviewable changes (1)
  • src/Craned/Core/JobManager.h
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (2)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
🔇 Additional comments (3)
src/Craned/Core/CranedServer.cpp (3)

33-35: LGTM: Signature changes follow gRPC conventions.

The refactoring of all RPC handler signatures from reference parameters to pointer parameters aligns with standard gRPC conventions and improves consistency across the codebase.

Also applies to: 43-45, 63-66, 160-162, 184-187, 294-296, 316-318, 327-329, 346-348, 363-366, 388-391, 414-417


52-54: LGTM: Map-based iteration improves batching efficiency.

The refactoring from individual item lists to job_step_ids_map improves efficiency for batched operations and maintains clear job-to-steps relationships. The iteration pattern is consistent across all affected handlers.

Also applies to: 74-76, 169-171, 336-338


222-222: LGTM: const-correctness improvements.

The addition of const qualifiers in range-based for loops and the constructor signature improves const-correctness without changing behavior.

Also applies to: 230-230, 235-235, 380-380, 430-430

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0f44df and c9019ac.

📒 Files selected for processing (4)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (27 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (7 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Core/CranedServer.cpp (14 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
  • src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
  • src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
  • src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:47:44.485Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Utilities/PublicHeader/Logger.cpp:21-21
Timestamp: 2025-06-23T07:47:44.485Z
Learning: In the CraneSched codebase, static logger sinks like default_sinks in src/Utilities/PublicHeader/Logger.cpp are initialized during startup in a single-threaded context and then become read-only during normal operation, eliminating the need for synchronization protection.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (4)
src/CraneCtld/TaskScheduler.cpp (4)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (843-851)
src/Craned/Core/JobManager.cpp (2)
  • StepStatusChangeAsync (1246-1255)
  • StepStatusChangeAsync (1246-1250)
src/Craned/Core/CtldClient.cpp (2)
  • StepStatusChangeAsync (564-573)
  • StepStatusChangeAsync (564-565)
src/Craned/Supervisor/CranedClient.cpp (2)
  • StepStatusChangeAsync (46-56)
  • StepStatusChangeAsync (46-48)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
  • CranedStub (28-34)
  • Fini (36-38)
  • Fini (36-36)
  • l (403-403)
  • ConfigureCraned (40-67)
  • ConfigureCraned (40-41)
  • AllocJobs (134-161)
  • AllocJobs (134-135)
  • FreeJobs (163-187)
  • FreeJobs (163-163)
  • AllocSteps (189-207)
  • AllocSteps (189-190)
  • job_id_t (208-242)
  • ExecuteSteps (209-210)
  • FreeSteps (244-268)
  • FreeSteps (244-245)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
  • TerminateSteps (69-98)
  • TerminateSteps (69-70)
  • TerminateOrphanedSteps (100-132)
  • TerminateOrphanedSteps (100-101)
  • CranedKeeper (365-386)
  • CranedKeeper (388-395)
  • Shutdown (397-417)
  • Shutdown (397-397)
  • AvailableCranedCount (720-723)
  • AvailableCranedCount (720-720)
  • IsCranedConnected (725-728)
  • IsCranedConnected (725-725)
  • GetCranedStub (730-737)
  • GetCranedStub (730-731)
  • SetCranedConnectedCb (739-742)
  • SetCranedConnectedCb (739-740)
  • SetCranedDisconnectedCb (744-746)
  • SetCranedDisconnectedCb (744-744)
  • PutNodeIntoUnavailSet (748-756)
  • PutNodeIntoUnavailSet (748-749)
  • ConnectCranedNode_ (758-846)
  • ConnectCranedNode_ (758-759)
  • CranedChannelConnFailNoLock_ (848-854)
  • CranedChannelConnFailNoLock_ (848-848)
  • InitCranedStateMachine_ (525-622)
  • InitCranedStateMachine_ (525-526)
  • EstablishedCranedStateMachine_ (624-691)
  • EstablishedCranedStateMachine_ (624-625)
  • CheckNodeTimeoutAndClean (693-718)
  • CheckNodeTimeoutAndClean (693-693)
  • StateMonitorThreadFunc_ (419-523)
  • StateMonitorThreadFunc_ (419-419)
  • PeriodConnectCranedThreadFunc_ (856-892)
  • PeriodConnectCranedThreadFunc_ (856-856)
src/CraneCtld/RpcService/CranedKeeper.cpp (3)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendJobs (88-138)
  • SuspendJobs (88-91)
  • ResumeJobs (140-189)
  • ResumeJobs (140-142)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (102-105)
src/Craned/Core/CtldClient.h (1)
  • UpdateLastActiveTime (193-195)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format

Comment on lines 1292 to 1297
StepStatusChangeAsync(task_id, 0, craned_id,
crane::grpc::TaskStatus::Failed,
ExitCode::kExitCodeRpcError, "", absl::Now());
}

results.emplace_back(code);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix bogus step_id in failure cleanup.

Both suspend/resume failure paths call StepStatusChangeAsync(task_id, 0, …); step id 0 doesn’t correspond to any real step, so CleanTaskStatusChangeQueueCb_ drops the notification as “unknown step”. When a craned is offline, that means the controller never marks the job failed even though we already tore it down, leaving m_running_task_map_ inconsistent. Use the actual daemon step id (e.g. kDaemonStepId) or iterate the real executing step IDs when fanning out the status change so the failure path reliably clears the job state.

Also applies to: 1434-1439

🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1292-1297 (and similarly
1434-1439), the failure paths call StepStatusChangeAsync(task_id, 0, ...) using
step id 0 which is not a real step and gets dropped by
CleanTaskStatusChangeQueueCb_; replace the hardcoded 0 with the actual daemon
step id (e.g. kDaemonStepId) or, if fanning out to multiple executing steps,
iterate the real executing step IDs and call StepStatusChangeAsync for each so
the controller correctly recognizes and clears the task state; apply the same
fix at both locations and ensure the StepStatusChangeAsync arguments match what
CleanTaskStatusChangeQueueCb_ expects.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)

433-436: Free CqTag on early return when m_cq_closed_.

You continue without deleting the pmr-allocated tag, leaking the tag object. Delete the tag before continuing.

-    if (m_cq_closed_) {
-      if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
-      continue;
-    }
+    if (m_cq_closed_) {
+      if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
+      m_tag_sync_allocator_->delete_object(tag);
+      continue;
+    }
♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)

1291-1296: Fix bogus step_id in failure fanout—use kDaemonStepId, not 0.

StepStatusChangeAsync with step_id 0 is ignored downstream, so failed suspend/resume never clears Ctld state. Use the daemon step id for the job when broadcasting failure.

Apply this diff at both call sites:

-      for (const auto& craned_id : executing_nodes) {
-        StepStatusChangeAsync(
-            task_id, 0, craned_id, crane::grpc::TaskStatus::Failed,
-            ExitCode::kExitCodeRpcError, "",
-            google::protobuf::util::TimeUtil::GetCurrentTime());
-      }
+      for (const auto& craned_id : executing_nodes) {
+        StepStatusChangeAsync(
+            task_id, kDaemonStepId, craned_id, crane::grpc::TaskStatus::Failed,
+            ExitCode::kExitCodeRpcError, "Suspend/Resume RPC failure",
+            google::protobuf::util::TimeUtil::GetCurrentTime());
+      }

Also applies to: 1435-1439

🧹 Nitpick comments (4)
src/CraneCtld/TaskScheduler.cpp (1)

1315-1343: Unify RPC worker pool for broadcast.

Elsewhere this class uses m_rpc_worker_pool_ for node fanouts; here it uses g_thread_pool. For consistency and easier load control, prefer m_rpc_worker_pool_->detach_task.

-    for (const auto& craned_id : executing_nodes) {
-      g_thread_pool->detach_task([&, craned_id]() {
+    for (const auto& craned_id : executing_nodes) {
+      m_rpc_worker_pool_->detach_task([&, craned_id]() {
         ...
       });
     }

Also applies to: 1458-1485

src/CraneCtld/RpcService/CranedKeeper.cpp (1)

269-297: Optional: align server error semantics to avoid non-OK gRPC status disconnects.

Ctld handles non-OK as RPC failure. If feasible, make Craned return Status::OK with reply.ok=false (see repo practice) to avoid needless channel disruptions. Code here can remain unchanged.

Based on learnings

Also applies to: 299-326

src/CraneCtld/RpcService/CranedKeeper.h (2)

194-197: Remove unused/undefined declaration CranedChannelConnectFail_.

Only CranedChannelConnFailNoLock_ is defined/used. Drop the unused declaration to avoid confusion.

-  static void CranedChannelConnectFail_(CranedStub* stub);
-
   static void CranedChannelConnFailNoLock_(CranedStub* stub);

74-77: Name/type clarity: FreeJobs takes task_id_t but fills job_id_list.

Types are aliases, but parameter naming (task) conflicts with proto field job_id_list(). Consider renaming to job_ids for readability or switch to std::vector<job_id_t>.

-  CraneErrCode FreeJobs(const std::vector<task_id_t>& task);
+  CraneErrCode FreeJobs(const std::vector<job_id_t>& job_ids);

And in the .cpp use job_ids when assigning to request.mutable_job_id_list().

Also applies to: 163-176

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9019ac and b9cd9de.

📒 Files selected for processing (3)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (27 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (7 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (3)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (843-851)
src/Craned/Core/JobManager.cpp (2)
  • StepStatusChangeAsync (1246-1255)
  • StepStatusChangeAsync (1246-1250)
src/Craned/Core/CtldClient.cpp (2)
  • StepStatusChangeAsync (564-573)
  • StepStatusChangeAsync (564-565)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
  • CranedStub (28-34)
  • Fini (36-38)
  • Fini (36-36)
  • l (403-403)
  • ConfigureCraned (40-67)
  • ConfigureCraned (40-41)
  • AllocJobs (134-161)
  • AllocJobs (134-135)
  • FreeJobs (163-187)
  • FreeJobs (163-163)
  • AllocSteps (189-207)
  • AllocSteps (189-190)
  • job_id_t (208-242)
  • ExecuteSteps (209-210)
  • FreeSteps (244-268)
  • FreeSteps (244-245)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
  • TerminateSteps (69-98)
  • TerminateSteps (69-70)
  • TerminateOrphanedSteps (100-132)
  • TerminateOrphanedSteps (100-101)
  • CranedKeeper (365-386)
  • CranedKeeper (388-395)
  • Shutdown (397-417)
  • Shutdown (397-397)
  • AvailableCranedCount (720-723)
  • AvailableCranedCount (720-720)
  • IsCranedConnected (725-728)
  • IsCranedConnected (725-725)
  • GetCranedStub (730-737)
  • GetCranedStub (730-731)
  • SetCranedConnectedCb (739-742)
  • SetCranedConnectedCb (739-740)
  • SetCranedDisconnectedCb (744-746)
  • SetCranedDisconnectedCb (744-744)
  • PutNodeIntoUnavailSet (748-756)
  • PutNodeIntoUnavailSet (748-749)
  • CranedChannelConnFailNoLock_ (848-854)
  • CranedChannelConnFailNoLock_ (848-848)
  • ConnectCranedNode_ (758-846)
  • ConnectCranedNode_ (758-759)
  • InitCranedStateMachine_ (525-622)
  • InitCranedStateMachine_ (525-526)
  • EstablishedCranedStateMachine_ (624-691)
  • EstablishedCranedStateMachine_ (624-625)
  • CheckNodeTimeoutAndClean (693-718)
  • CheckNodeTimeoutAndClean (693-693)
  • StateMonitorThreadFunc_ (419-523)
  • StateMonitorThreadFunc_ (419-419)
  • PeriodConnectCranedThreadFunc_ (856-892)
  • PeriodConnectCranedThreadFunc_ (856-856)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendJobs (88-138)
  • SuspendJobs (88-91)
  • ResumeJobs (140-189)
  • ResumeJobs (140-142)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (102-105)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

2154-2207: Consider preserving the first error instead of the last.

The error aggregation logic (lines 2172, 2199) overwrites result each time a task operation fails. If multiple tasks fail, only the last error is returned, losing information about earlier failures.

Apply this pattern to preserve the first error encountered:

       CraneErrCode err = task->Suspend();
       if (err != CraneErrCode::SUCCESS) {
         CRANE_ERROR("Failed to suspend task #{}: {}", task_id,
                     CraneErrStr(err));
-        result = err;
+        if (result == CraneErrCode::SUCCESS) result = err;
       } else {

Apply the same change in ResumeSuspendedTasks_() at line 2199.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9cd9de and fa2180c.

📒 Files selected for processing (1)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
🔇 Additional comments (5)
src/Craned/Supervisor/TaskManager.cpp (5)

113-125: LGTM! GetTaskInstance now safely handles missing tasks.

Changing from at() to find() with a nullptr return properly addresses the previous issue where iterating m_step_.task_ids could throw std::out_of_range for tasks not yet instantiated or already cleaned up.


608-636: LGTM! Process group signaling is correct.

The suspend/resume implementations correctly use negative PID (-m_pid_) to signal the entire process group, which ensures child processes are also suspended/resumed. Error handling is appropriate.


1104-1225: LGTM! Container suspend/resume with appropriate fallback.

The implementation correctly:

  • Checks container state before applying OCI runtime pause/resume commands
  • Falls back to process group signaling when runtime commands are unavailable or container is in an unexpected state
  • Mirrors the defensive pattern used in the Kill() method

1835-1861: LGTM! Async API follows established patterns.

The suspend/resume async methods correctly follow the same pattern as other async operations in TaskManager (enqueue element with promise, send async event, return future).


1662-1666: LGTM! Async handle setup and callback are correct.

The task signal async handle is properly initialized in the constructor and the callback correctly:

  • Dequeues signal requests
  • Dispatches to appropriate helper method based on action type
  • Fulfills the promise with the operation result

Also applies to: 2136-2152

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

2163-2177: Error aggregation returns the last error instead of the first.

The current implementation overwrites result for each failed task, meaning the returned error code corresponds to the last failure rather than the first. While continuing to process all tasks is correct (to ensure partial success), returning the first error is typically more useful for debugging.

Consider tracking the first error:

   CraneErrCode result = CraneErrCode::SUCCESS;
 
   for (const auto& task_id : m_step_.task_ids) {
     auto* task = m_step_.GetTaskInstance(task_id);
     if (task) {
       CraneErrCode err = task->Suspend();
       if (err != CraneErrCode::SUCCESS) {
         CRANE_ERROR("Failed to suspend task #{}: {}", task_id,
                     CraneErrStr(err));
-        result = err;
+        if (result == CraneErrCode::SUCCESS) result = err;
       } else {
         CRANE_DEBUG("Task #{} suspended successfully", task_id);
       }
     }
   }

Apply the same change to ResumeSuspendedTasks_() as well.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa2180c and 26ea975.

📒 Files selected for processing (1)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
🔇 Additional comments (6)
src/Craned/Supervisor/TaskManager.cpp (6)

112-126: LGTM! Safe lookup pattern.

The change from at() to find() with nullptr return is correct and addresses the past review concern about crashes when task IDs aren't instantiated. Callers now properly handle missing tasks.


608-636: LGTM! Process group suspend/resume is correct.

The implementation correctly uses kill(-m_pid_, SIGSTOP/SIGCONT) to suspend/resume the entire process group. Error handling and logging are appropriate.


1104-1225: LGTM! Container suspend/resume with fallback.

The implementation correctly attempts OCI runtime pause/resume first (if configured), then falls back to process group signaling. The state checking and fallback strategy are sound.


1662-1666: LGTM! Async handle wiring.

The new signal queue handle follows the established pattern used by other async handlers in TaskManager.


1835-1861: LGTM! Async API structure.

The suspend/resume async APIs follow the established pattern for async operations in TaskManager, correctly using promises/futures and queue-based communication.


2182-2207: Add status check before resuming tasks.

Similar to suspend, the resume operation should verify the step state before attempting to resume tasks. The step should be in a suspended or paused state for resume to be valid.

Based on learnings.

Apply this diff to add the status check:

 CraneErrCode TaskManager::ResumeSuspendedTasks_() {
   CRANE_DEBUG("Resuming all suspended tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Running) {
+    CRANE_DEBUG("Step is not in Running state, cannot resume.");
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to resume
   if (m_step_.IsDaemon()) {
     CRANE_DEBUG("Daemon step cannot be resumed, skipping.");
⛔ Skipped due to learnings
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Comment on lines 2154 to 2180
CraneErrCode TaskManager::SuspendRunningTasks_() {
CRANE_DEBUG("Suspending all running tasks");

// Daemon steps don't have actual task instances to suspend
if (m_step_.IsDaemon()) {
CRANE_DEBUG("Daemon step cannot be suspended, skipping.");
return CraneErrCode::ERR_GENERIC_FAILURE;
}

CraneErrCode result = CraneErrCode::SUCCESS;

for (const auto& task_id : m_step_.task_ids) {
auto* task = m_step_.GetTaskInstance(task_id);
if (task) {
CraneErrCode err = task->Suspend();
if (err != CraneErrCode::SUCCESS) {
CRANE_ERROR("Failed to suspend task #{}: {}", task_id,
CraneErrStr(err));
result = err;
} else {
CRANE_DEBUG("Task #{} suspended successfully", task_id);
}
}
}

return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add status check before suspending tasks.

The suspend operation should verify that the step is in a Running state before attempting to suspend tasks. Without this check, the method might attempt to suspend tasks that are in Completing, Pending, or other non-running states, leading to unexpected behavior.

Based on learnings.

Apply this diff to add the status check:

 CraneErrCode TaskManager::SuspendRunningTasks_() {
   CRANE_DEBUG("Suspending all running tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Running) {
+    CRANE_DEBUG("Step is not in Running state, cannot suspend.");
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to suspend
   if (m_step_.IsDaemon()) {
     CRANE_DEBUG("Daemon step cannot be suspended, skipping.");

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2154 to 2180, add a guard
that checks the step's status before iterating and suspending tasks: if the step
is not in the Running state (e.g. via m_step_.IsRunning() or comparing
m_step_.status), log a debug/info message and return an appropriate error code
(such as CraneErrCode::ERR_INVALID_STATE) to avoid attempting Suspend on tasks
when the step is Completing/Pending/etc.; keep the existing daemon check and
existing task suspension loop unchanged otherwise.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26ea975 and d881699.

📒 Files selected for processing (11)
  • README_EN.md (1 hunks)
  • docs/en/command/ccontrol.md (4 hunks)
  • docs/zh/command/ccontrol.md (5 hunks)
  • protos/Crane.proto (4 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • protos/Supervisor.proto (2 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/SupervisorKeeper.cpp (2 hunks)
  • src/Craned/Core/SupervisorKeeper.h (1 hunks)
  • src/Craned/Supervisor/SupervisorServer.cpp (2 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • README_EN.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/Craned/Supervisor/SupervisorServer.h
  • protos/Supervisor.proto
  • protos/PublicDefs.proto
  • src/CraneCtld/TaskScheduler.h
  • protos/Crane.proto
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.h
  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/Craned/Core/SupervisorKeeper.cpp
🧬 Code graph analysis (3)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (93-111)
  • SuspendJob (93-93)
  • ResumeJob (113-131)
  • ResumeJob (113-113)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (95-109)
  • SuspendJob (95-98)
  • ResumeJob (111-125)
  • ResumeJob (111-114)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (93-111)
  • SuspendJob (93-93)
  • ResumeJob (113-131)
  • ResumeJob (113-113)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (95-109)
  • SuspendJob (95-98)
  • ResumeJob (111-125)
  • ResumeJob (111-114)
🪛 LanguageTool
docs/en/command/ccontrol.md

[grammar] ~265-~265: Ensure spelling is correct
Context: ...job_id (required): ID of the job to suspend Description: - Can only suspend jobs ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~268-~268: Ensure spelling is correct
Context: ...n:** - Can only suspend jobs in RUNNING state - After suspension, job status changes to ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~271-~271: Ensure spelling is correct
Context: ... data and state are preserved for later resumption Example: ```bash # Suspend a job ccon...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~289-~289: Ensure spelling is correct
Context: ...job_id (required): ID of the job to resume Description: - Can only resume jobs i...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~292-~292: Ensure spelling is correct
Context: ...:** - Can only resume jobs in SUSPENDED state - After resumption, the job reallocates re...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~293-~293: Ensure spelling is correct
Context: ...job reallocates resources and continues execution - The job continues from where it was susp...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~307-~307: Ensure spelling is correct
Context: ...iting for resources to become available again ### Create and Delete Commands #### Create...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (6)
src/Craned/Core/SupervisorKeeper.h (1)

39-40: LGTM! Method declarations follow established patterns.

The new SuspendJob() and ResumeJob() method declarations are consistent with the existing RPC wrapper methods in SupervisorStub. The signatures follow the same pattern (no parameters, return CraneErrCode) as other control methods like ExecuteStep() and ShutdownSupervisor().

docs/en/command/ccontrol.md (1)

253-308: Documentation is comprehensive and accurate.

The suspend/resume documentation is well-structured and clearly explains:

  • Command syntax and parameters
  • State requirements (RUNNING for suspend, SUSPENDED for resume)
  • Resource behavior (released on suspend, reallocated on resume)
  • Key differences from hold/release operations

The static analysis grammar warnings are false positives—the flagged words (suspend, resumption, execution) are correctly spelled technical terms.

Note: As with the Chinese documentation, verify that the TaskManager implementation enforces the documented state restrictions.

src/Craned/Supervisor/SupervisorServer.cpp (2)

23-23: Include is correctly added.

The crane/PublicHeader.h include is necessary for the CraneErrStr() function used in the new RPC handlers (lines 106, 122).


95-125: RPC handlers follow established patterns correctly.

Both SuspendJob and ResumeJob implementations follow the established async pattern used throughout the codebase:

  1. Call the async method on g_task_mgr
  2. Wait for the future to complete
  3. Set response fields based on the result
  4. Return Status::OK (per project convention to keep gRPC connections alive)

The previous naming inconsistency issue (ResumeTask vs ResumeJob) has been resolved—both methods now consistently use the "Job" suffix.

Based on learnings.

src/Craned/Core/SupervisorKeeper.cpp (2)

22-22: Include is correctly added.

The crane/PublicHeader.h include is necessary for the CraneErrStr() function used on lines 106, 108, 126, and 128.


93-131: Client stub implementations correctly handle errors.

Both SuspendJob() and ResumeJob() implementations properly handle RPC failures:

  1. Transport errors: Lines 99-102 and 119-122 check status.ok() first and return ERR_RPC_FAILURE for gRPC transport failures
  2. Application errors: Lines 105-110 and 125-130 map specific reply.reason() strings to appropriate error codes

This follows the established pattern in the codebase and addresses the previous review comments about transport error handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

599-644: Job-only validation for suspend/resume operations.

Per the PR requirements, suspend and resume should only support Job objects, not Steps. The current implementation delegates validation to SuspendRunningTasks() and ResumeSuspendedTasks(), but explicit validation at the RPC handler level would make the Job-only restriction clearer and prevent potential issues if the architecture changes.

Based on PR requirements comment from L-Xiafeng.

src/Craned/Core/Craned.cpp (1)

633-647: Make RuntimePause and RuntimeResume optional to preserve backward compatibility.

The current implementation exits if RuntimePause or RuntimeResume are missing, but the Supervisor's ContainerInstance::Suspend/Resume methods provide fallback to SIGSTOP/SIGCONT when these commands are not configured. Making these fields required unnecessarily breaks existing deployments.

Apply this diff to make the fields optional:

-            if (container_config["RuntimePause"]) {
-              g_config.Container.RuntimePause =
-                  container_config["RuntimePause"].as<std::string>();
-            } else {
-              CRANE_ERROR("RuntimePause is not configured.");
-              std::exit(1);
-            }
-
-            if (container_config["RuntimeResume"]) {
-              g_config.Container.RuntimeResume =
-                  container_config["RuntimeResume"].as<std::string>();
-            } else {
-              CRANE_ERROR("RuntimeResume is not configured.");
-              std::exit(1);
-            }
+            g_config.Container.RuntimePause =
+                YamlValueOr(container_config["RuntimePause"], "");
+            g_config.Container.RuntimeResume =
+                YamlValueOr(container_config["RuntimeResume"], "");
src/CraneCtld/TaskScheduler.cpp (1)

1283-1299: Use real step id instead of 0 in StepStatusChangeAsync on suspend/resume failure

In both handle_failure lambdas (Lines 1283-1299 and 1426-1442), you call:

StepStatusChangeAsync(
    task_id, 0, craned_id, crane::grpc::TaskStatus::Failed,
    ExitCode::kExitCodeRpcError, "",
    google::protobuf::util::TimeUtil::GetCurrentTime());

Step id 0 does not correspond to any real step, so CleanTaskStatusChangeQueueCb_() drops these updates as “unknown step”, and the job’s in‑memory/DB status may never be transitioned to a final state if TerminateSteps RPCs fail. This is the same issue that was raised previously and is still present here.

You already use kDaemonStepId for failure fan‑out in TerminateTasksOnCraned and wire daemon steps specially in CleanTaskStatusChangeQueueCb_(), so reusing that here keeps the failure paths consistent.

Consider updating both suspend and resume failure paths like this:

-      for (const auto& craned_id : executing_nodes) {
-        StepStatusChangeAsync(
-            task_id, 0, craned_id, crane::grpc::TaskStatus::Failed,
-            ExitCode::kExitCodeRpcError, "",
-            google::protobuf::util::TimeUtil::GetCurrentTime());
-      }
+      auto now = google::protobuf::util::TimeUtil::GetCurrentTime();
+      for (const auto& craned_id : executing_nodes) {
+        StepStatusChangeAsync(
+            task_id, kDaemonStepId, craned_id,
+            crane::grpc::TaskStatus::Failed, ExitCode::kExitCodeRpcError, "",
+            now);
+      }

Apply the same change in both SuspendRunningTasks and ResumeSuspendedTasks.

Also applies to: 1426-1442

src/Craned/Supervisor/TaskManager.cpp (2)

1104-1225: Container-specific Suspend/Resume correctly avoid destructive Kill path

The new ContainerInstance::Suspend()/Resume() overrides use the OCI runtime state plus RuntimePause/RuntimeResume commands when the container is in "running"/"paused" states, and fall back to the base ITaskInstance::Suspend()/Resume() (process-group SIGSTOP/SIGCONT) only on parse errors, missing commands, or unexpected states.

This addresses the earlier design problem where suspend/resume went through ContainerInstance::Kill(SIGSTOP/SIGCONT) and triggered the full kill/delete workflow. Container workloads are now paused via the runtime instead of being torn down, with a reasonable best-effort signal fallback.

No additional issues from this change.


1622-1667: Suspend/Resume async pipeline is consistent, but consider guarding on step state

The new m_task_signal_async_handle_ + SuspendJobAsync/ResumeJobAsync + EvCleanTaskSignalQueueCb_ wiring matches the existing async patterns (terminate/time-limit) and looks sound: requests are funneled onto the UV loop, a single TaskSignalQueueElem carries the action and promise, and SuspendRunningTasks_/ResumeSuspendedTasks_ aggregate per-task results correctly (first non-success is preserved).

Two points worth tightening:

  1. Step state guard
    Unlike EvCleanChangeTaskTimeLimitQueueCb_ and EvCleanTerminateTaskQueueCb_, the suspend/resume path does not check g_runtime_status.CanStepOperate() or the step’s own status before iterating tasks. If a suspend arrives while the step is Completing or already finished, you’ll either:

    • get ERR_NON_EXISTENT from tasks with m_pid_ == 0, or
    • silently do nothing but still report SUCCESS when all tasks are gone.

    To make behaviour predictable and consistent with other operations, consider adding a guard near the top of SuspendRunningTasks_() and ResumeSuspendedTasks_() (or before dispatch in EvCleanTaskSignalQueueCb_()) that:

    • returns an explicit invalid-state error if the step is not Running/Suspended, or
    • short-circuits to SUCCESS when m_step_.AllTaskFinished().

    This also aligns with the prior suggestion to ensure suspends only happen from a Running state.

  2. Live-task iteration
    Iterating m_step_.task_ids and skipping nullptr from GetTaskInstance() is safe and avoids the previous at()-throw issue. If you ever want “only actually running tasks”, switching to iterate m_pid_task_id_map_ | std::views::values (similar to terminate) would more directly reflect instantiated tasks, but that’s an optional refinement.

None of this is blocking, but adding the state guard would make the suspend/resume semantics clearer and less error-prone under races.

Also applies to: 1835-1861, 2136-2209

src/CraneCtld/RpcService/CranedKeeper.cpp (1)

163-187: FreeJobs should take job IDs, not task IDs (type/name mismatch)

CranedStub::FreeJobs currently takes const std::vector<task_id_t>& task but fills FreeJobsRequest::job_id_list from it:

request.mutable_job_id_list()->Assign(task.begin(), task.end());

Semantically this RPC operates on job IDs, and a maintainer has already pointed out the expected type (vector<job_id_t>). Keeping task_id_t here is confusing and makes it easier to misuse the API from call sites.

Consider changing both the parameter type and name, e.g.:

-CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& task) {
+CraneErrCode CranedStub::FreeJobs(const std::vector<job_id_t>& job_ids) {
   ...
-  request.mutable_job_id_list()->Assign(task.begin(), task.end());
+  request.mutable_job_id_list()->Assign(job_ids.begin(), job_ids.end());

This aligns the stub’s signature with the proto and the PR’s focus on job-level operations.

🧹 Nitpick comments (2)
src/Craned/Supervisor/TaskManager.cpp (1)

112-126: GetTaskInstance now returns nullptr; ensure deref sites respect the new contract

Switching both StepInstance::GetTaskInstance overloads to find() and returning nullptr on miss is good for robustness; it removes the risk of at() throwing. However, several call sites (e.g. EvCleanSigchldQueueCb_ and EvCleanTaskStopQueueCb_) still immediately dereference the returned pointer without checks, relying on invariants about m_task_map_.

Consider either:

  • adding CRANE_ASSERT(task != nullptr) at those deref sites, or
  • doing a defensive null check with a log+early-return path,

to make the invariants explicit and failure mode clearer if the map gets out of sync in future refactors.

src/Craned/Core/CranedServer.cpp (1)

88-145: SuspendJobs/ResumeJobs job-level handling is sound; verify “job not found on this node” semantics

The new SuspendJobs/ResumeJobs handlers correctly:

  • Enforce ReadyFor(RequestSource::CTLD) like existing control RPCs.
  • Use g_job_mgr->GetAllocatedJobSteps() to discover per-job steps on this Craned.
  • Skip daemon steps (step_id == 0).
  • Call SuspendJob()/ResumeJob() on each supervisor stub and aggregate errors into reasons, returning ok=false plus a detailed reason string when any step fails.

One behavioural detail to double-check with the overall design: when a requested job_id is not present in job_steps, you currently treat this as an error (all_ok = false, reason "job X: Job not found"). That’s correct if Ctld only sends suspend/resume requests to nodes that are known to host the job and the absence indicates an inconsistency. However, if the RPC can reach nodes that never had or no longer have steps for that job (e.g., after cleanup), it may be more appropriate to log and ignore the missing job on that node instead of failing the whole RPC.

Please confirm that the Ctld-side CranedStub::SuspendJobs/ResumeJobs dispatch only targets nodes that still own allocated steps for the given jobs; if not, consider relaxing the “job not found” case to be non-fatal on that craned.

Also applies to: 146-201

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d881699 and 0f3ca65.

📒 Files selected for processing (21)
  • protos/Crane.proto (3 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • protos/Supervisor.proto (3 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (27 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (7 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/CranedPublicDefs.h (1 hunks)
  • src/Craned/Core/CranedServer.cpp (14 hunks)
  • src/Craned/Core/CranedServer.h (1 hunks)
  • src/Craned/Core/JobManager.h (0 hunks)
  • src/Craned/Core/SupervisorKeeper.cpp (2 hunks)
  • src/Craned/Core/SupervisorKeeper.h (1 hunks)
  • src/Craned/Supervisor/Supervisor.cpp (1 hunks)
  • src/Craned/Supervisor/SupervisorPublicDefs.h (1 hunks)
  • src/Craned/Supervisor/SupervisorServer.cpp (2 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
  • src/Craned/Supervisor/TaskManager.h (9 hunks)
💤 Files with no reviewable changes (1)
  • src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (4)
  • protos/PublicDefs.proto
  • protos/Crane.proto
  • src/CraneCtld/TaskScheduler.h
  • src/Craned/Supervisor/SupervisorServer.h
🧰 Additional context used
🧠 Learnings (25)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/Craned.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/CranedServer.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Core/Craned.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/SupervisorKeeper.cpp
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.h
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/SupervisorKeeper.cpp
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/SupervisorKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/CranedServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Core/SupervisorKeeper.cpp
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/SupervisorKeeper.cpp
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (13)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
src/Craned/Supervisor/TaskManager.cpp (8)
  • Suspend (608-621)
  • Suspend (608-608)
  • Suspend (1104-1164)
  • Suspend (1104-1104)
  • Resume (623-636)
  • Resume (623-623)
  • Resume (1166-1225)
  • Resume (1166-1166)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (843-851)
src/Craned/Core/JobManager.cpp (2)
  • StepStatusChangeAsync (1246-1255)
  • StepStatusChangeAsync (1246-1250)
src/Craned/Supervisor/CranedClient.cpp (2)
  • StepStatusChangeAsync (46-56)
  • StepStatusChangeAsync (46-48)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
src/Craned/Supervisor/TaskManager.h (2)
src/Utilities/PublicHeader/include/crane/Logger.h (1)
  • crane (200-212)
src/Craned/Supervisor/TaskManager.cpp (18)
  • Suspend (608-621)
  • Suspend (608-608)
  • Suspend (1104-1164)
  • Suspend (1104-1104)
  • Resume (623-636)
  • Resume (623-623)
  • Resume (1166-1225)
  • Resume (1166-1166)
  • SuspendJobAsync (1835-1847)
  • SuspendJobAsync (1835-1835)
  • ResumeJobAsync (1849-1861)
  • ResumeJobAsync (1849-1849)
  • EvCleanTaskSignalQueueCb_ (2136-2152)
  • EvCleanTaskSignalQueueCb_ (2136-2136)
  • SuspendRunningTasks_ (2154-2181)
  • SuspendRunningTasks_ (2154-2154)
  • ResumeSuspendedTasks_ (2183-2209)
  • ResumeSuspendedTasks_ (2183-2183)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (93-111)
  • SuspendJob (93-93)
  • ResumeJob (113-131)
  • ResumeJob (113-113)
src/Craned/Supervisor/Supervisor.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (163-163)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (17)
  • CranedStub (28-34)
  • Fini (36-38)
  • Fini (36-36)
  • l (403-403)
  • ConfigureCraned (40-67)
  • ConfigureCraned (40-41)
  • job_id_t (208-242)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
  • CranedKeeper (365-386)
  • CranedKeeper (388-395)
  • CranedChannelConnFailNoLock_ (848-854)
  • CranedChannelConnFailNoLock_ (848-848)
  • ConnectCranedNode_ (758-846)
  • ConnectCranedNode_ (758-759)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (93-111)
  • SuspendJob (93-93)
  • ResumeJob (113-131)
  • ResumeJob (113-113)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (95-109)
  • SuspendJob (95-98)
  • ResumeJob (111-125)
  • ResumeJob (111-114)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
  • SuspendJobs (88-144)
  • SuspendJobs (88-91)
  • ResumeJobs (146-201)
  • ResumeJobs (146-148)
  • CranedServer (474-502)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (270-297)
  • SuspendJobs (270-270)
  • ResumeJobs (299-326)
  • ResumeJobs (299-299)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (95-109)
  • SuspendJob (95-98)
  • ResumeJob (111-125)
  • ResumeJob (111-114)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendJobs (88-144)
  • SuspendJobs (88-91)
  • ResumeJobs (146-201)
  • ResumeJobs (146-148)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (102-105)
🔇 Additional comments (18)
src/Craned/Core/CranedPublicDefs.h (1)

109-110: LGTM!

The addition of RuntimePause and RuntimeResume fields to ContainerConfig is straightforward and follows the existing pattern for other runtime command fields.

src/Craned/Supervisor/Supervisor.cpp (1)

110-111: LGTM!

The initialization of RuntimePause and RuntimeResume from the container config message is consistent with the existing pattern for other runtime commands.

src/Craned/Supervisor/SupervisorPublicDefs.h (1)

61-62: LGTM!

The addition of RuntimePause and RuntimeResume fields mirrors the changes in CranedPublicDefs.h and maintains consistency across the supervisor configuration.

src/Craned/Core/SupervisorKeeper.cpp (1)

93-131: LGTM!

The SuspendJob() and ResumeJob() RPC wrappers correctly handle both transport failures (returning ERR_RPC_FAILURE) and application-level errors (mapping specific error reasons to appropriate CraneErrCode values). The implementation follows the established pattern used by other RPC wrappers in this class.

src/Craned/Core/SupervisorKeeper.h (1)

39-40: LGTM!

The method declarations for SuspendJob() and ResumeJob() are consistent with the existing API design, returning CraneErrCode to indicate success or failure.

src/Craned/Supervisor/SupervisorServer.cpp (1)

95-125: LGTM!

The SuspendJob and ResumeJob RPC handlers correctly follow the established async pattern: they invoke the corresponding async methods on g_task_mgr, wait for completion, and populate the response payload with success/failure information. Returning Status::OK with error details in the response is consistent with CraneSched conventions.

Based on learnings.

protos/Supervisor.proto (2)

44-53: Container pause/resume commands fit existing ContainerConfig pattern

pause_cmd and resume_cmd mirror the existing state/run/kill/delete commands and keep numbering contiguous; no issues from the proto/API perspective.


127-139: Job-level suspend/resume RPCs are consistent with existing API

SuspendJob/ResumeJob request/reply shapes (ok + reason) and naming align with TerminateTaskReply and the “Job-only” semantics requested in the PR comments. The service methods match the message names cleanly; looks good.

Also applies to: 151-152

src/CraneCtld/TaskScheduler.cpp (1)

1237-1383: Suspend/resume orchestration and rollback logic are otherwise well-structured

Apart from the step-id issue above, the flows look solid:

  • Strict validation of task presence and status (Running vs Suspended).
  • Pre-check of all executing nodes’ online state before issuing RPCs.
  • Thread-pool–based broadcast of SuspendJobs/ResumeJobs with shared error aggregation.
  • On any per-node RPC failure, you terminate the job and propagate a uniform controller-side failure code.
  • On success, you update in-memory status and persist it with rollback on DB failure.

This matches prior review guidance on using thread pools for broadcast RPCs and keeping Ctld/craned state consistent.

Also applies to: 1385-1524

src/CraneCtld/RpcService/CranedKeeper.h (1)

37-43: CranedStub suspend/resume API and const-correct signatures look consistent

The tightened signatures (const references for IDs/tokens/containers) and the new SuspendJobs/ResumeJobs methods using std::vector<job_id_t> line up with the job-level suspend/resume flow implemented in CranedKeeper.cpp. Using CranedStub* inside CqTag and function callbacks is consistent with existing lifetime guarantees in CranedKeeper; no additional issues from the header side.

Also applies to: 47-52, 74-90, 115-116, 155-159, 166-167, 178-181, 184-185, 192-199, 200-205, 211-212, 225-244

src/Craned/Supervisor/TaskManager.h (2)

59-60: Per-instance suspend/resume hooks and status field align with new semantics

Adding StepInstance::status as a crane::grpc::TaskStatus and extending ITaskInstance/ContainerInstance with Suspend()/Resume() gives Supervisor a clean hook to implement job-level suspend/resume. The interface surface matches the implementations in TaskManager.cpp (process-group SIGSTOP/SIGCONT with container-specific overrides), and types are consistent with the rest of the codebase.

Also applies to: 164-177, 260-271, 277-282


364-367: TaskManager’s suspend/resume async queue is consistent with existing event model

SuspendJobAsync/ResumeJobAsync plus TaskSignalQueueElem, m_task_signal_async_handle_, and EvCleanTaskSignalQueueCb_() reuse the same uvw + ConcurrentQueue pattern as other TaskManager actions (execute, terminate, change time limit). Running SuspendRunningTasks_()/ResumeSuspendedTasks_() on the loop thread avoids extra synchronization around m_step_ and task instances. Behavior and threading look correct from the header’s point of view.

Also applies to: 389-392, 402-403, 407-408, 428-429

src/Craned/Supervisor/TaskManager.cpp (1)

608-636: Per-task Suspend/Resume implementation for non-container tasks looks correct

ITaskInstance::Suspend()/Resume() sending SIGSTOP/SIGCONT to -m_pid_ (the process group) matches the earlier setpgid(0, 0) setup, so the whole task group is stopped/resumed atomically. Returning ERR_NON_EXISTENT when m_pid_ == 0 is a reasonable way to signal “not spawned yet / already gone”.

I don’t see correctness issues here; behaviour aligns with typical job-control semantics.

src/Craned/Core/CranedServer.cpp (1)

32-61: Signature and loop-formatting cleanups preserve existing behaviour

The pointer-style updates on ServerContext/request/response parameters and the structured-binding for (const auto& [...]) loops in Configure, ExecuteSteps, TerminateSteps, TerminateOrphanedStep, AllocJobs, AllocSteps, FreeSteps, FreeJobs, QuerySshStepEnvVariables, ChangeJobTimeLimit, and StepStatusChange are purely stylistic. The underlying control-flow and request/response handling remain unchanged and consistent with existing patterns in this file.

No functional concerns from these changes.

Also applies to: 63-86, 203-226, 338-405, 407-472, 474-502

src/Craned/Core/CranedServer.h (1)

53-118: New SuspendJobs/ResumeJobs RPCs and signature cleanups look consistent

The added SuspendJobs/ResumeJobs declarations on CranedServiceImpl align with the implementations in CranedServer.cpp and the generated crane::grpc::Craned service. The pointer-style and const-correctness updates across the other RPC declarations, plus the CranedServer constructor signature tweak, are consistent formatting changes with no behavioural impact.

Header/implementation stay in sync; no issues here.

Also applies to: 120-123

src/CraneCtld/RpcService/CranedKeeper.cpp (3)

69-99: Job/step ID marshalling in CranedStub RPCs is consistent and correct

The refactors in:

  • TerminateSteps,
  • TerminateOrphanedSteps,
  • AllocJobs,
  • AllocSteps,
  • ExecuteSteps,
  • FreeSteps

all use the proto’s job_step_ids_map fields with mutable_job_step_ids_map() and Assign() over the std::set<step_id_t> collections. This matches the corresponding server-side handlers and avoids redundant copies.

These changes improve const-correctness on the input containers and keep the wire format unchanged. No issues here.

Also applies to: 100-133, 189-207, 209-242, 244-268


270-297: SuspendJobs/ResumeJobs stubs mirror existing patterns and look correct

CranedStub::SuspendJobs and CranedStub::ResumeJobs:

  • Build the respective request types and assign job_id_list from the provided std::vector<job_id_t>.
  • Set per-RPC deadlines using kCtldRpcTimeoutSeconds, consistent with other methods.
  • Map non-OK gRPC status to ERR_RPC_FAILURE via HandleGrpcErrorCode_, and update m_last_active_time_ on success.
  • Interpret reply.ok()/reply.reason() the same way as other control operations.

These wrappers are consistent with the existing Terminate* / Alloc* / Free* stubs; no functional issues spotted.

Also applies to: 299-326


525-622: CqTag-based state machines and connection management changes look coherent

The refactors to:

  • InitCranedStateMachine_,
  • EstablishedCranedStateMachine_,
  • CheckNodeTimeoutAndClean,
  • IsCranedConnected / GetCranedStub,
  • PutNodeIntoUnavailSet,
  • ConnectCranedNode_,
  • CranedChannelConnFailNoLock_,

all consistently use CqTag carrying a CranedStub*, and the m_clean_up_cb_ hookup allows CranedStub::Fini() to notify CranedKeeper when a channel fails during initial connection. The read/write locking patterns on m_connect_craned_mtx_ and m_unavail_craned_set_mtx_ are preserved, and the timeouts and connection-backoff behaviour are unchanged.

Aside from the StateMonitorThreadFunc_ tag handling noted separately, these changes look correct and maintain the existing channel-state semantics.

Also applies to: 624-691, 725-757, 758-846, 848-855

Comment on lines +391 to 392
for (auto& cq_thread : m_cq_thread_vec_) cq_thread.join();
if (m_period_connect_thread_.joinable()) m_period_connect_thread_.join();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Avoid dereferencing CQ tag pointer on TIMEOUT during shutdown

In CranedKeeper::StateMonitorThreadFunc_ you now use CqTag* tag and call:

next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
if (next_status == grpc::CompletionQueue::SHUTDOWN) break;

if (m_cq_closed_) {
  if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
  continue;
}

if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
if (next_status == grpc::CompletionQueue::GOT_EVENT) {
  ...
  if (CheckNodeTimeoutAndClean(tag)) continue;
  ...
  craned->m_channel_->NotifyOnStateChange(..., tag);
}

When AsyncNext returns TIMEOUT, the contents of tag are unspecified; dereferencing tag in the m_cq_closed_ block before the TIMEOUT check is therefore unsafe. If Shutdown() has been called (setting m_cq_closed_ = true) but the CQ returns a timeout, this can lead to undefined behaviour or crashes on shutdown.

A safer pattern is:

  • Only interpret tag as a valid CqTag* when next_status == GOT_EVENT.
  • Handle TIMEOUT and SHUTDOWN without touching tag.
  • Apply the m_cq_closed_ cleanup only in the GOT_EVENT branch.

For example:

-  CqTag* tag;
+  CqTag* tag = nullptr;
   ...
-  next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
-  if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
-
-  if (m_cq_closed_) {
-    if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
-    continue;
-  }
-
-  if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
-  if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+  next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
+
+  if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
+  if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
+
+  if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+    if (m_cq_closed_) {
+      if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
+      m_tag_sync_allocator_->delete_object(tag);
+      continue;
+    }
@@
-      auto* craned = tag->craned;
+      auto* craned = tag->craned;
@@
-      if (next_tag) {
+      if (next_tag) {
         if (!m_cq_closed_) {
           if (CheckNodeTimeoutAndClean(tag)) continue;
           ...

(Exact cleanup behaviour for the closed case can be adapted to your needs, but the key is to never dereference tag on TIMEOUT.)

This change keeps the state machine logic intact while eliminating a shutdown-time race/UB risk.

Also applies to: 424-439, 693-718


I'll verify the review comment by examining the actual code in the repository.


---

🏁 Script executed:

```shell
#!/bin/bash
# Locate and examine CranedKeeper.cpp
fd -t f "CranedKeeper.cpp" | head -20

Length of output: 41



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the lines mentioned in the review comment
echo "=== Lines 391-392 ==="
sed -n '385,400p' src/CraneCtld/RpcService/CranedKeeper.cpp

echo ""
echo "=== Lines 424-439 ==="
sed -n '418,445p' src/CraneCtld/RpcService/CranedKeeper.cpp

echo ""
echo "=== Lines 693-718 ==="
sed -n '687,725p' src/CraneCtld/RpcService/CranedKeeper.cpp

Length of output: 2754



---

🏁 Script executed:

```shell
#!/bin/bash
# Get broader context around lines 693-718 to see if similar pattern exists
echo "=== Context around lines 693-718 (looking for StateMonitorThreadFunc_ or similar) ==="
sed -n '650,750p' src/CraneCtld/RpcService/CranedKeeper.cpp | head -100

Length of output: 3082



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for all AsyncNext calls and surrounding context to find similar patterns
echo "=== All AsyncNext occurrences in CranedKeeper.cpp ==="
rg -n "AsyncNext" src/CraneCtld/RpcService/CranedKeeper.cpp -B 5 -A 20

Length of output: 1048


Unsafe tag pointer deref on TIMEOUT confirmed, but scope reference is incorrect

The unsafe pattern is confirmed at lines 424-439 in StateMonitorThreadFunc_, where tag->type is dereferenced in the m_cq_closed_ block (line 434) before checking if next_status == TIMEOUT (line 438). Per gRPC API contract, tag contents are unspecified on TIMEOUT, making this a genuine shutdown-time race condition that can cause undefined behavior or crashes.

However, the review comment's reference to "also applies to: 693-718" is incorrect—those lines contain state machine logic without similar AsyncNext patterns. The unsafe dereference pattern exists only at lines 424-439.

The fix recommended in the review comment (moving m_cq_closed_ cleanup into the GOT_EVENT branch after TIMEOUT and SHUTDOWN checks) is the correct approach.

🤖 Prompt for AI Agents
In src/CraneCtld/RpcService/CranedKeeper.cpp around lines 424-439,
StateMonitorThreadFunc_ currently dereferences tag->type inside the m_cq_closed_
block before checking if next_status == TIMEOUT, which is unsafe because gRPC
leaves tag contents unspecified on TIMEOUT; move the m_cq_closed_ cleanup logic
out of the branch that may run on TIMEOUT and into the GOT_EVENT branch after
you check for TIMEOUT and SHUTDOWN so that tag is only dereferenced when
next_status == GOT_EVENT (and therefore valid); ensure TIMEOUT and SHUTDOWN are
checked first, skip any tag access on TIMEOUT, and only perform tag-based
cleanup when next_status indicates a real event.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

868-873: ContainerInstance::Cleanup is defined twice and the new one doesn’t return.

There are now two CraneErrCode ContainerInstance::Cleanup() definitions (Lines 868–873 and 1284–1289). The second one also omits a return, which is UB for a non-void function.

You need a single, merged implementation that both:

  • tears down pod/container as before, and
  • asynchronously deletes the temporary bundle directory.

For example, fold the new logic into the existing definition and remove the duplicate:

CraneErrCode ContainerInstance::Cleanup() {
-  // For container tasks, Kill() is idempotent.
-  // It's ok to call it (at most) twice to remove pod and container.
-  CraneErrCode err = Kill(0);
-  return err;
+  // Delete temporary bundle directory if present.
+  if ((m_parent_step_inst_->IsBatch() || m_parent_step_inst_->IsCrun()) &&
+      !m_temp_path_.empty()) {
+    g_thread_pool->detach_task(
+        [p = m_temp_path_]() { util::os::DeleteFolders(p); });
+  }
+
+  // For container tasks, Kill() is idempotent; safe to call during cleanup.
+  return Kill(0);
}

and drop the later Cleanup definition entirely.

Also applies to: 1284-1289

♻️ Duplicate comments (3)
protos/Supervisor.proto (1)

122-135: Supervisor suspend/resume RPCs align with single-job supervisor model

SuspendJob/ResumeJob and their replies are minimal and job-scoped, consistent with the existing Supervisor API where the instance already knows its job/step. Naming matches prior “Task→Job” feedback and the PR goal of job-level operations.

Also applies to: 140-148

src/CraneCtld/TaskScheduler.cpp (1)

1297-1313: Avoid StepStatusChangeAsync with step_id == 0 in failure handlers

In both suspend and resume failure handlers you call:

StepStatusChangeAsync(task_id, 0, craned_id, crane::grpc::TaskStatus::Failed,
                      ExitCode::kExitCodeRpcError, "", timestamp);

CleanTaskStatusChangeQueueCb_ looks up steps by real step IDs (daemon/primary/common) and logs Ignoring unknown step when the ID doesn’t match any step. Using 0 here thus just produces warnings without contributing to cleanup, which is already handled via TerminateRunningTaskNoLock_ and the cancel queue.

Consider either:

  • Switching to kDaemonStepId if you really want an explicit failure status from this path, or
  • Dropping this extra StepStatusChangeAsync loop and relying solely on TerminateRunningTask for cleanup and status propagation.

Functionality is currently correct, but cleaning this up would reduce log noise and redundant status events.

Also applies to: 1440-1456

src/CraneCtld/RpcService/CranedKeeper.cpp (1)

476-580: Still dereferencing CQ tag on TIMEOUT when m_cq_closed_ is set.

This is the same issue noted previously: AsyncNext leaves tag unspecified on NextStatus::TIMEOUT, but the m_cq_closed_ branch dereferences tag->type before you check for TIMEOUT. If Shutdown() has set m_cq_closed_ = true while the CQ returns TIMEOUT, this is undefined behavior.

You should only touch tag when next_status == GOT_EVENT, e.g.:

-  bool ok;
-  CqTag* tag;
+  bool ok;
+  CqTag* tag = nullptr;
   ...
-    next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
-
-    // If Shutdown() is called, return immediately.
-    if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
-
-    if (m_cq_closed_) {
-      if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
-      continue;
-    }
-
-    if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
-    if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+    next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
+
+    if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
+    if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
+
+    if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+      if (m_cq_closed_) {
+        if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
+        m_tag_sync_allocator_->delete_object(tag);
+        continue;
+      }
       // existing state-machine logic …

so that tag is never dereferenced on TIMEOUT.

In gRPC C++'s `grpc::CompletionQueue::AsyncNext`, what are the guarantees about the `tag` pointer when `NextStatus::TIMEOUT` is returned? Is it valid to dereference `tag` in that case, or is the tag value unspecified?
🧹 Nitpick comments (4)
src/Craned/Supervisor/TaskManager.h (1)

454-457: TaskManager async suspend/resume wiring matches existing event-loop patterns

The new job-level APIs (SuspendJobAsync / ResumeJobAsync), TaskSignalQueueElem queue, m_task_signal_async_handle_, EvCleanTaskSignalQueueCb_, and SuspendRunningTasks_ / ResumeSuspendedTasks_ all follow the established uvw + ConcurrentQueue + promise pattern used elsewhere in TaskManager. They correctly:

  • Enqueue suspend/resume actions with a promise and trigger the async handle.
  • Drain the queue in EvCleanTaskSignalQueueCb_ and route to the appropriate internal method.
  • Apply suspend/resume across all m_step_.task_ids via the task instances.

Assuming m_task_signal_async_handle_ is initialized and bound to EvCleanTaskSignalQueueCb_ in TaskManager.cpp (as with the other async handles), this looks solid.

Also applies to: 479-482, 498-505, 531-533

src/Craned/Supervisor/TaskManager.cpp (2)

1161-1221: Container suspend/resume logic is reasonable with OCI-first then POSIX fallback.

The new ContainerInstance::Suspend/Resume methods correctly:

  • query runtime state via RuntimeState,
  • call RuntimePause/RuntimeResume when the container is in running/paused,
  • log and fall back to ITaskInstance::Suspend/Resume when the runtime path fails or is misconfigured.

This avoids the previous destructive Kill(SIGSTOP/SIGCONT) behavior and keeps container semantics aligned with processes.

For readability you might prefer explicit JSON conversion:

status = jret["status"].get<std::string>();

instead of relying on status = std::move(jret["status"]);, but behavior is fine as-is.


1706-1711: Suspend/resume async pipeline wiring looks consistent; consider state guards.

The new m_task_signal_async_handle_, SuspendJobAsync/ResumeJobAsync, EvCleanTaskSignalQueueCb_, and SuspendRunningTasks_/ResumeSuspendedTasks_ follow the same queue + uvw-async pattern used for terminate/time-limit changes, and they now aggregate per-task errors instead of short-circuiting, which is good.

Two behavioral points to consider:

  • SuspendRunningTasks_/ResumeSuspendedTasks_ return SUCCESS even if no task instance is actually suspended/resumed (e.g., all IDs missing from m_task_map_). If you want to surface “nothing to operate on” as an error, track whether at least one task was acted on and return an appropriate code when not.
  • Per earlier reviewer feedback (“m_step_.status首先应该是Running”), you may want to add a simple status guard (e.g. check g_runtime_status.Status or m_step_.IsRunning() / “is suspended”) before iterating, similar to how EvCleanChangeTaskTimeLimitQueueCb_ uses CanStepOperate(). That would avoid silently “succeeding” on already-completing/completed steps.

If the current behavior is intentional, no functional change is strictly required.

Also applies to: 1899-1925, 2217-2233, 2235-2290

src/CraneCtld/RpcService/CranedKeeper.h (1)

76-77: Minor naming/type mismatch in FreeJobs parameter.

FreeJobs takes const std::vector<task_id_t>& task, but the RPC it drives uses a job_id_list. For consistency with AllocJobs/SuspendJobs/ResumeJobs and the “job-level” semantics, consider changing this to const std::vector<job_id_t>& job_ids (and rename the variable) in both declaration and definition.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f3ca65 and df67cee.

📒 Files selected for processing (14)
  • protos/Crane.proto (3 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • protos/Supervisor.proto (2 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (27 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (7 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/CranedServer.cpp (14 hunks)
  • src/Craned/Core/CranedServer.h (2 hunks)
  • src/Craned/Core/JobManager.h (0 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (6 hunks)
  • src/Craned/Supervisor/TaskManager.h (6 hunks)
💤 Files with no reviewable changes (1)
  • src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/Craned/Supervisor/SupervisorServer.h
  • protos/PublicDefs.proto
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/CraneCtld/TaskScheduler.cpp:2811-2874
Timestamp: 2025-09-22T07:13:20.635Z
Learning: In the CraneSched codebase, there are latch mechanisms used to ensure variable lifetime safety when capturing references in detached thread pool tasks, allowing reference captures that might otherwise be unsafe.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Core/CranedServer.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/CranedServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/CranedServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (8)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • SuspendRunningTasks (1251-1397)
  • SuspendRunningTasks (1251-1252)
  • ResumeSuspendedTasks (1399-1538)
  • ResumeSuspendedTasks (1399-1400)
src/CraneCtld/TaskScheduler.cpp (4)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (849-857)
src/Craned/Core/JobManager.cpp (2)
  • StepStatusChangeAsync (1231-1240)
  • StepStatusChangeAsync (1231-1235)
src/Craned/Core/CtldClient.cpp (2)
  • StepStatusChangeAsync (564-573)
  • StepStatusChangeAsync (564-565)
src/Craned/Supervisor/CranedClient.cpp (2)
  • StepStatusChangeAsync (46-56)
  • StepStatusChangeAsync (46-48)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
  • Suspend (606-619)
  • Suspend (606-606)
  • Suspend (1161-1221)
  • Suspend (1161-1161)
  • Resume (621-634)
  • Resume (621-621)
  • Resume (1223-1282)
  • Resume (1223-1223)
  • SuspendJobAsync (1899-1911)
  • SuspendJobAsync (1899-1899)
  • ResumeJobAsync (1913-1925)
  • ResumeJobAsync (1913-1913)
  • EvCleanTaskSignalQueueCb_ (2217-2233)
  • EvCleanTaskSignalQueueCb_ (2217-2217)
  • SuspendRunningTasks_ (2235-2262)
  • SuspendRunningTasks_ (2235-2235)
  • ResumeSuspendedTasks_ (2264-2290)
  • ResumeSuspendedTasks_ (2264-2264)
src/Craned/Supervisor/TaskManager.cpp (3)
src/Craned/Supervisor/TaskManager.h (3)
  • GetTaskInstance (133-136)
  • GetTaskInstance (138-141)
  • m_pid_ (378-378)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
src/Utilities/PublicHeader/OS.cpp (2)
  • DeleteFolders (73-82)
  • DeleteFolders (73-73)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
  • SuspendJobs (92-148)
  • SuspendJobs (92-95)
  • ResumeJobs (150-205)
  • ResumeJobs (150-152)
  • CranedServer (640-668)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
  • CranedStub (29-35)
  • Fini (37-39)
  • Fini (37-37)
  • l (460-460)
  • ConfigureCraned (41-68)
  • ConfigureCraned (41-42)
  • AllocJobs (135-162)
  • AllocJobs (135-136)
  • FreeJobs (164-188)
  • FreeJobs (164-164)
  • AllocSteps (190-208)
  • AllocSteps (190-191)
  • job_id_t (210-244)
  • ExecuteSteps (211-212)
  • FreeSteps (246-270)
  • FreeSteps (246-247)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
  • TerminateSteps (70-99)
  • TerminateSteps (70-71)
  • TerminateOrphanedSteps (101-133)
  • TerminateOrphanedSteps (101-102)
  • CranedKeeper (422-443)
  • CranedKeeper (445-452)
  • Shutdown (454-474)
  • Shutdown (454-454)
  • AvailableCranedCount (777-780)
  • AvailableCranedCount (777-777)
  • IsCranedConnected (782-785)
  • IsCranedConnected (782-782)
  • GetCranedStub (787-794)
  • GetCranedStub (787-788)
  • SetCranedConnectedCb (796-799)
  • SetCranedConnectedCb (796-797)
  • SetCranedDisconnectedCb (801-803)
  • SetCranedDisconnectedCb (801-801)
  • PutNodeIntoUnavailSet (805-813)
  • PutNodeIntoUnavailSet (805-806)
  • CranedChannelConnFailNoLock_ (905-911)
  • CranedChannelConnFailNoLock_ (905-905)
  • ConnectCranedNode_ (815-903)
  • ConnectCranedNode_ (815-816)
  • InitCranedStateMachine_ (582-679)
  • InitCranedStateMachine_ (582-583)
  • EstablishedCranedStateMachine_ (681-748)
  • EstablishedCranedStateMachine_ (681-682)
  • CheckNodeTimeoutAndClean (750-775)
  • CheckNodeTimeoutAndClean (750-750)
  • StateMonitorThreadFunc_ (476-580)
  • StateMonitorThreadFunc_ (476-476)
  • PeriodConnectCranedThreadFunc_ (913-949)
  • PeriodConnectCranedThreadFunc_ (913-913)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
  • SuspendJobs (92-148)
  • SuspendJobs (92-95)
  • ResumeJobs (150-205)
  • ResumeJobs (150-152)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (108-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (15)
src/CraneCtld/TaskScheduler.h (1)

762-765: Suspend/Resume TaskScheduler APIs look consistent and well‑scoped

The new SuspendRunningTasks / ResumeSuspendedTasks APIs match the implementations in TaskScheduler.cpp and expose a clear per‑task error vector for higher layers. No issues from the header side.

protos/Crane.proto (1)

143-161: Job-level suspend/resume protobufs and Craned RPCs are well designed

The new SuspendJobs*/ResumeJobs* messages, TargetAttributes.Suspend/Resume enum values, and Craned.SuspendJobs/ResumeJobs RPCs are all job-centric and consistent with existing naming and control patterns. The wire format remains backward-compatible and matches the scheduler-side usage.

Also applies to: 249-271, 1093-1094

src/CraneCtld/TaskScheduler.cpp (1)

1251-1397: Suspend/Resume workflows are logically sound and consistent with scheduler design

The new SuspendRunningTasks / ResumeSuspendedTasks implementations correctly:

  • Guard access to m_running_task_map_ with the existing mutex and validate task state (RunningSuspended) before acting.
  • Require all executing craneds to be online before issuing suspend/resume, and treat any offline craned as an RPC failure.
  • Broadcast SuspendJobs / ResumeJobs via the thread pool with BlockingCounter, aggregating per-node failures into a single failure_code.
  • On any RPC failure, terminate the task via TerminateRunningTask to avoid mixed states across nodes.
  • On success, update in‑memory status and persist to the embedded DB, rolling back the in‑memory status if persistence fails to keep RAM/DB consistent.

This matches prior resource‑lifecycle patterns in TaskScheduler and should behave robustly under partial failures.

Also applies to: 1399-1538

src/Craned/Supervisor/TaskManager.h (1)

217-223: Base and container-specific Suspend/Resume contracts are coherent

Adding default ITaskInstance::Suspend/Resume (SIGSTOP/SIGCONT on the process group) and overriding them in ContainerInstance to use OCI runtime pause/resume with a fallback to the base behavior is a clean extension of the existing abstraction. Header and cpp implementations are consistent.

Also applies to: 252-257

src/Craned/Supervisor/TaskManager.cpp (1)

606-634: Base ITaskInstance::Suspend/Resume look correct.

Sending SIGSTOP/SIGCONT to the process group via kill(-m_pid_, …) with clear success/error codes is consistent with existing Kill semantics and provides a clean default for non-container tasks.

src/CraneCtld/RpcService/CranedKeeper.h (2)

36-42: CranedStub suspend/resume APIs now clearly job-centric.

The new SuspendJobs(const std::vector<job_id_t>&) and ResumeJobs(const std::vector<job_id_t>&) methods, plus the consistent use of job_id_t in their signatures, match the requirement that suspend/resume operate on Job objects, not steps. Deleting copy/move on CranedStub also keeps ownership clear.

Also applies to: 74-88


121-139: Const-ref and callback signature cleanups look good.

The switch to const references for CranedId/RegToken, the guarded optional token field, and the updated callback types (std::function<void(CranedId, const RegToken&)>) improve clarity and avoid unnecessary copies without changing behavior.

Also applies to: 144-145, 172-188, 190-191, 200-211, 217-239

src/Craned/Core/CranedServer.cpp (2)

92-205: Suspend/Resume job RPC handlers are well-aligned with Supervisor layer.

  • You gate on ReadyFor(CTLD) and return UNAVAILABLE with a reason, matching existing CTLD-facing RPCs.
  • For each requested job, you:
    • look up allocated steps via GetAllocatedJobSteps(),
    • ignore daemon steps (step_id == 0),
    • fan out to per-step SupervisorKeeper::GetStub(job_id, step_id) and invoke SuspendJob/ResumeJob,
    • aggregate per-step errors into reasons and set ok only when all steps succeed.

This matches the “job-level” suspend/resume requirement while still operating per-step internally and avoids touching daemon steps that have no actual tasks.


46-65: Other RPC signature and loop updates are mechanical and safe.

The pointer-style signature adjustments (ServerContext* context, const request/response pointers) and the use of structured bindings / range-based loops in ExecuteSteps, TerminateSteps, TerminateOrphanedStep, FreeSteps, QuerySshStepEnvVariables, and StepStatusChange are all stylistic; they preserve existing logic and error-handling behavior.

Also applies to: 67-90, 207-230, 232-340, 342-363, 364-393, 395-410, 412-436, 438-464, 624-638

src/Craned/Core/CranedServer.h (2)

53-78: CranedServiceImpl interface correctly exposes SuspendJobs/ResumeJobs.

The new SuspendJobs and ResumeJobs declarations (with ServerContext* and const request/response pointers) match the implementations in CranedServer.cpp and integrate cleanly with the existing CTLD-oriented RPC set.


89-109: Signature and constructor tweaks are consistent with implementation.

Switching RPCs like AllocJobs/FreeJobs/AllocSteps/FreeSteps/ChangeJobTimeLimit to use ServerContext* context and taking CranedListenConf by const reference in the constructor simply aligns the header with the definitions; no behavior change, and the API remains clear.

Also applies to: 126-129

src/CraneCtld/RpcService/CranedKeeper.cpp (4)

272-299: CranedStub Suspend/Resume job RPC helpers are correct.

Both helpers follow the existing RPC pattern:

  • Build SuspendJobsRequest/ResumeJobsRequest with job_id_list from the provided std::vector<job_id_t>.
  • Apply a standard deadline, call the stub, handle non-OK Status as ERR_RPC_FAILURE (with HandleGrpcErrorCode_ + UpdateLastActiveTime()).
  • Interpret reply.ok() vs reply.reason() as success vs ERR_GENERIC_FAILURE.

This matches the new Craned-side RPCs and the desired job-level suspend/resume semantics.

Also applies to: 301-328


422-443: CranedKeeper ctor/destructor/Shutdown changes are mechanical.

Using std::pmr::synchronized_pool_resource, allocating m_tag_sync_allocator_, computing thread_num via std::bit_ceil, and the updated joins/Shutdown loop are consistent with prior behavior; the only changes are pointer/loop style and don’t affect semantics.

Also applies to: 445-474, 459-465


582-679: Channel state-machine helpers and timeout cleanup remain consistent.

InitCranedStateMachine_, EstablishedCranedStateMachine_, and CheckNodeTimeoutAndClean all use the updated CqTag fields (CranedStub* craned) and continue to manage connected/connecting sets, tokens, and callbacks as before. The timeout path now deletes the tag inside CheckNodeTimeoutAndClean, which is correctly accounted for by the caller’s if (CheckNodeTimeoutAndClean(tag)) continue; logic.

Also applies to: 681-748, 750-775


782-813: Connection management const-ref and cleanup callback wiring look good.

  • IsCranedConnected/GetCranedStub now take const CranedId&, and the connected callback uses std::function<void(CranedId, const RegToken&)>, which reduces copying.
  • ConnectCranedNode_ takes CranedId const& and const RegToken&, caches resolved IPs, creates the channel, and sets m_clean_up_cb_ to CranedChannelConnFailNoLock_, which assumes m_connect_craned_mtx_ is already held (as enforced by its AssertHeld()).
  • The cleanup callback itself just decrements the channel count and erases the craned ID from the connecting set under the lock.

All of this matches the surrounding locking and lifecycle expectations.

Also applies to: 815-903, 905-911

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)

480-496: Unsafe CQ tag dereference on TIMEOUT; move closed-queue cleanup under GOT_EVENT

On TIMEOUT, tag is unspecified. You dereference tag in the m_cq_closed_ branch before checking TIMEOUT, risking UB during shutdown. Initialize tag, check TIMEOUT first, and only touch tag inside GOT_EVENT. Example fix:

-  bool ok;
-  CqTag* tag;
+  bool ok;
+  CqTag* tag = nullptr;
   grpc::CompletionQueue::NextStatus next_status;
@@
-  next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
-
-  // If Shutdown() is called, return immediately.
-  if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
-
-  if (m_cq_closed_) {
-    if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
-    continue;
-  }
-
-  if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
-  if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+  next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
+  if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
+  if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
+  if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+    if (m_cq_closed_) {
+      if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
+      m_tag_sync_allocator_->delete_object(tag);
+      continue;
+    }
     // ... existing logic ...

Also applies to: 503-506, 517-524

♻️ Duplicate comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

640-662: Resume branch integrates cleanly

Per-task result handling and messages match ResumeSuspendedTasks semantics.

🧹 Nitpick comments (7)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)

272-299: Map reply.reason to specific error codes (parity with SupervisorStub)

Currently both helpers collapse declines to ERR_GENERIC_FAILURE. Improve fidelity by mapping known reasons (e.g., ERR_INVALID_PARAM, ERR_NON_EXISTENT) to their codes, matching SupervisorStub’s pattern. Minimal change:

-  if (reply.ok()) return CraneErrCode::SUCCESS;
-  CRANE_WARN("SuspendJobs RPC for Node {} declined: {}", m_craned_id_, reply.reason());
-  return CraneErrCode::ERR_GENERIC_FAILURE;
+  if (reply.ok()) return CraneErrCode::SUCCESS;
+  CRANE_WARN("SuspendJobs RPC for Node {} declined: {}", m_craned_id_, reply.reason());
+  if (reply.reason() == CraneErrStr(CraneErrCode::ERR_INVALID_PARAM)) return CraneErrCode::ERR_INVALID_PARAM;
+  if (reply.reason() == CraneErrStr(CraneErrCode::ERR_NON_EXISTENT))  return CraneErrCode::ERR_NON_EXISTENT;
+  return CraneErrCode::ERR_GENERIC_FAILURE;

Apply similarly to ResumeJobs.

Also applies to: 301-328


164-176: Rename FreeJobs param for clarity

Parameter is named task but represents job IDs. Rename to job_ids to avoid confusion.

-CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& task) {
+CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& job_ids) {
@@
-  request.mutable_job_id_list()->Assign(task.begin(), task.end());
+  request.mutable_job_id_list()->Assign(job_ids.begin(), job_ids.end());
src/CraneCtld/TaskScheduler.h (1)

762-766: Consider documenting job-only contract and result ordering

Add brief comments that inputs are Job IDs (not Steps) and results align 1:1 by index. Helps callers and future maintainers.

src/Craned/Supervisor/TaskManager.cpp (3)

2005-2021: EvCleanTaskSignalQueueCb_ correctly drains queue but lacks operation guard

The dispatcher correctly drains m_task_signal_queue_, dispatches to suspend/resume, and fulfills the promise once per element. However, unlike termination/time-limit handlers, it does not check g_runtime_status.CanStepOperate() or step completion before acting, so suspend/resume may silently no-op on already-completing/finished steps.

You might align this with EvCleanTerminateTaskQueueCb_ / EvCleanChangeTaskTimeLimitQueueCb_ by:

  • Early-returning or re-enqueueing when !g_runtime_status.CanStepOperate() or m_step_.AllTaskFinished(), and
  • Returning an appropriate CraneErrCode instead of SUCCESS in those cases.

This keeps suspend/resume semantics consistent with other step-level operations.


2023-2050: Iterate over live task instances instead of m_step_.task_ids in SuspendRunningTasks_

m_step_.task_ids represents the static task-id list, while m_step_.GetTaskIds() iterates the keys of m_task_map_ (i.e., currently instantiated tasks). Using task_ids here may include tasks that were never started or already cleaned up; you guard with if (task) so it won’t crash, but it does extra lookups and diverges from how termination uses GetTaskIds().

Consider:

-  for (const auto& task_id : m_step_.task_ids) {
+  for (auto task_id : m_step_.GetTaskIds()) {
     auto* task = m_step_.GetTaskInstance(task_id);

This keeps suspend behavior aligned with the termination path and guarantees you only touch live instances.


2052-2078: Mirror live-task iteration and error aggregation in ResumeSuspendedTasks_

Same observation as SuspendRunningTasks_: iterating m_step_.task_ids will probe tasks that may not be instantiated anymore; the if (task) guard prevents crashes, but it’s less precise than iterating live tasks. Error aggregation strategy (result holds last failure) is acceptable given no state flip occurs here.

Apply the same pattern as suggested for SuspendRunningTasks_:

-  for (const auto& task_id : m_step_.task_ids) {
+  for (auto task_id : m_step_.GetTaskIds()) {
     auto* task = m_step_.GetTaskInstance(task_id);

This keeps resume behavior consistent with other lifecycle handlers.

src/Craned/Core/CranedServer.cpp (1)

92-148: SuspendJobs/ResumeJobs implement job-level control as intended; consider error semantics

The new Suspend/Resume RPCs:

  • Enforce ReadyFor(RequestSource::CTLD) with UNAVAILABLE status, consistent with nearby RPCs.
  • Operate strictly at job granularity (job_id_list()), derive local steps via GetAllocatedJobSteps(), and fan out to per-step Supervisor stubs while skipping daemon step step_id == 0, matching the “job-only” requirement.
  • Accumulate human-readable reasons and expose aggregate ok plus reason in the reply.

Behavior is functionally correct, but note that per-project learnings generally prefer returning Status::OK even on errors and carrying failure information in the reply to avoid disconnecting clients; these methods reuse the existing UNAVAILABLE-on-not-ready pattern instead. Based on learnings.

If you want to fully align with that convention, you could in a future change:

  • Always return Status::OK, and
  • For the “not ready” case, set ok = false and a descriptive reason instead of returning UNAVAILABLE.

Also applies to: 150-205

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df67cee and 6ffd32e.

📒 Files selected for processing (14)
  • protos/Crane.proto (3 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • protos/Supervisor.proto (2 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (27 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.h (7 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Craned/Core/CranedServer.cpp (14 hunks)
  • src/Craned/Core/CranedServer.h (2 hunks)
  • src/Craned/Core/JobManager.h (0 hunks)
  • src/Craned/Supervisor/SupervisorServer.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (3 hunks)
  • src/Craned/Supervisor/TaskManager.h (5 hunks)
💤 Files with no reviewable changes (1)
  • src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • protos/Supervisor.proto
  • src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧠 Learnings (23)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/Craned/Core/CranedServer.h
  • src/Craned/Core/CranedServer.cpp
  • src/Craned/Supervisor/SupervisorServer.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Core/CranedServer.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

  • src/Craned/Core/CranedServer.cpp
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/SupervisorServer.h
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/Craned/Supervisor/TaskManager.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/SupervisorServer.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.h
  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (7)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
  • SuspendJobs (92-148)
  • SuspendJobs (92-95)
  • ResumeJobs (150-205)
  • ResumeJobs (150-152)
  • CranedServer (640-668)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (5)
  • ExecuteSteps (211-212)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • SuspendRunningTasks (1251-1397)
  • SuspendRunningTasks (1251-1252)
  • ResumeSuspendedTasks (1399-1538)
  • ResumeSuspendedTasks (1399-1400)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
  • SuspendJob (93-111)
  • SuspendJob (93-93)
  • ResumeJob (113-131)
  • ResumeJob (113-113)
src/Craned/Supervisor/SupervisorServer.cpp (4)
  • SuspendJob (95-109)
  • SuspendJob (95-98)
  • ResumeJob (111-125)
  • ResumeJob (111-114)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
  • CranedStub (29-35)
  • Fini (37-39)
  • Fini (37-37)
  • l (460-460)
  • ConfigureCraned (41-68)
  • ConfigureCraned (41-42)
  • AllocJobs (135-162)
  • AllocJobs (135-136)
  • FreeJobs (164-188)
  • FreeJobs (164-164)
  • AllocSteps (190-208)
  • AllocSteps (190-191)
  • job_id_t (210-244)
  • ExecuteSteps (211-212)
  • FreeSteps (246-270)
  • FreeSteps (246-247)
  • SuspendJobs (272-299)
  • SuspendJobs (272-272)
  • ResumeJobs (301-328)
  • ResumeJobs (301-301)
  • TerminateSteps (70-99)
  • TerminateSteps (70-71)
  • TerminateOrphanedSteps (101-133)
  • TerminateOrphanedSteps (101-102)
  • CranedKeeper (422-443)
  • CranedKeeper (445-452)
  • Shutdown (454-474)
  • Shutdown (454-454)
  • AvailableCranedCount (777-780)
  • AvailableCranedCount (777-777)
  • IsCranedConnected (782-785)
  • IsCranedConnected (782-782)
  • GetCranedStub (787-794)
  • GetCranedStub (787-788)
  • SetCranedConnectedCb (796-799)
  • SetCranedConnectedCb (796-797)
  • SetCranedDisconnectedCb (801-803)
  • SetCranedDisconnectedCb (801-801)
  • PutNodeIntoUnavailSet (805-813)
  • PutNodeIntoUnavailSet (805-806)
  • CranedChannelConnFailNoLock_ (905-911)
  • CranedChannelConnFailNoLock_ (905-905)
  • ConnectCranedNode_ (815-903)
  • ConnectCranedNode_ (815-816)
  • InitCranedStateMachine_ (582-679)
  • InitCranedStateMachine_ (582-583)
  • EstablishedCranedStateMachine_ (681-748)
  • EstablishedCranedStateMachine_ (681-682)
  • CheckNodeTimeoutAndClean (750-775)
  • CheckNodeTimeoutAndClean (750-750)
  • StateMonitorThreadFunc_ (476-580)
  • StateMonitorThreadFunc_ (476-476)
  • PeriodConnectCranedThreadFunc_ (913-949)
  • PeriodConnectCranedThreadFunc_ (913-913)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
  • SuspendJobAsync (1687-1699)
  • SuspendJobAsync (1687-1687)
  • ResumeJobAsync (1701-1713)
  • ResumeJobAsync (1701-1701)
  • EvCleanTaskSignalQueueCb_ (2005-2021)
  • EvCleanTaskSignalQueueCb_ (2005-2005)
  • SuspendRunningTasks_ (2023-2050)
  • SuspendRunningTasks_ (2023-2023)
  • ResumeSuspendedTasks_ (2052-2078)
  • ResumeSuspendedTasks_ (2052-2052)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
src/Craned/Core/CranedServer.cpp (8)
  • FreeJobs (395-410)
  • FreeJobs (395-397)
  • FreeSteps (376-393)
  • FreeSteps (376-378)
  • SuspendJobs (92-148)
  • SuspendJobs (92-95)
  • ResumeJobs (150-205)
  • ResumeJobs (150-152)
src/Craned/Core/JobManager.cpp (4)
  • FreeJobs (240-275)
  • FreeJobs (240-240)
  • FreeSteps (305-332)
  • FreeSteps (305-306)
src/CraneCtld/RpcService/CranedKeeper.h (1)
  • UpdateLastActiveTime (108-111)
src/Craned/Core/CtldClient.h (1)
  • UpdateLastActiveTime (193-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (17)
protos/PublicDefs.proto (1)

120-123: TaskStatus::Suspended added — OK

Enum addition looks consistent with new flows. Ensure any UI/query filters include Suspended.

protos/Crane.proto (3)

143-161: Suspend/Resume messages are minimal and clear

Job-granularity clarified in comments; fields match usage sites.


254-256: ModifyTask TargetAttributes extended — OK

Suspend/Resume entries align with Ctld handler branches.


1092-1095: Craned service: new SuspendJobs/ResumeJobs RPCs — OK

Surface matches CranedServer implementation and Ctld caller expectations.

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

617-639: Suspend branch integrates cleanly

Per-task result handling and messages look correct relative to TaskScheduler::SuspendRunningTasks.

src/Craned/Supervisor/TaskManager.cpp (2)

1494-1499: Async handle for task-signal queue is consistent with existing uvw pattern

Initialization and callback wiring for m_task_signal_async_handle_ mirror other async handles in this class; no behavioral concerns.


1687-1713: SuspendJobAsync/ResumeJobAsync queuing pattern looks correct

Promise/future creation, move into TaskSignalQueueElem, enqueue, and async signal follow the same pattern as other async RPC paths; lifetime and ownership look sound.

src/Craned/Supervisor/TaskManager.h (1)

454-457: Public SuspendJobAsync/ResumeJobAsync surface aligns with async pattern

The new public methods return std::future<CraneErrCode> and match how other async operations (ExecuteTaskAsync, ChangeTaskTimeLimitAsync) are exposed; signatures look appropriate for callers via SupervisorServer.

src/Craned/Core/CranedServer.cpp (2)

36-44: Signature/loop normalizations are cosmetic and consistent

Changes here are limited to pointer spacing (grpc::ServerContext*), use of const auto& in range-for loops, and minor formatting; they don’t alter behavior and keep style consistent across RPC handlers and server construction.

Also applies to: 46-50, 67-75, 207-215, 232-240, 342-370, 376-382, 395-401, 412-420, 438-446, 624-638, 640-668


430-431: Range-for in QuerySshStepEnvVariables now uses const-reference pair

Using const auto& [name, value] when copying the env map slightly reduces copying and is idiomatic; no functional concerns.

src/Craned/Core/CranedServer.h (1)

53-60: Header RPC declarations match implementation and normalize style

All RPC signatures now consistently take grpc::ServerContext* and const pointer request/response types, and the new SuspendJobs/ResumeJobs declarations align with the implementations in CranedServer.cpp. CranedServer’s constructor signature tweak is purely stylistic.

Also applies to: 61-69, 71-77, 79-83, 85-87, 89-103, 105-109, 110-124, 128-129

src/CraneCtld/RpcService/CranedKeeper.h (5)

38-45: CranedStub lifetime and token handling improvements look solid

Marking the constructor explicit and deleting copy/move operations matches the ownership expectations for stubs managed by CranedKeeper. SetRegToken now takes a const reference and CheckToken adds logging on mismatch while remaining mutex-guarded, which should help diagnose connection/token issues without changing semantics.

Also applies to: 47-52


74-80: New CranedStub::SuspendJobs/ResumeJobs APIs align with RPC layer

The stub now exposes SuspendJobs and ResumeJobs taking const std::vector<job_id_t>&, which matches the corresponding RPC implementations in the .cpp and the CTLD-level job-centric semantics. This is consistent with existing Alloc/Free/Execute/Terminate methods.

Also applies to: 86-88


121-145: Internal CranedStub members and cleanup callback remain consistent with keeper usage

m_craned_keeper_ pointer, m_clean_up_cb_ (now typed as std::function<void(CranedStub*)>), and the atomic/disconnected flags align with how CranedKeeper’s state machine and cleanup paths use them in the .cpp snippets; no issues spotted.


160-167: CranedKeeper interface tightening (const refs, deleted copy/move) is appropriate

Deleting copy/move for CranedKeeper and taking const CranedId& / const RegToken& in getters and callbacks (IsCranedConnected, GetCranedStub, SetCranedConnectedCb, PutNodeIntoUnavailSet) improves safety and avoids unnecessary copies. Signatures match the corresponding definitions in CranedKeeper.cpp.

Also applies to: 170-172, 184-191


194-213: State-machine helper signatures and callbacks are consistent with implementation

CqTag now stores a CranedStub*, and helpers like CranedChannelConnectFail_, CranedChannelConnFailNoLock_, ConnectCranedNode_, InitCranedStateMachine_, EstablishedCranedStateMachine_, and CheckNodeTimeoutAndClean all take strongly typed CranedStub*/CranedId/RegToken& parameters. This matches the .cpp snippets and clarifies the ownership model for channel-state transitions and connection lifecycle.

Also applies to: 217-250

src/Craned/Supervisor/SupervisorServer.h (1)

63-71: LGTM! New suspend/resume RPC handlers follow established patterns.

The method declarations are correctly structured and consistent with other gRPC service methods in this class. The implementations (visible in the relevant code snippets) properly return Status::OK and delegate to the task manager's async methods, which aligns with the system's error handling approach.

Note: The naming uses "Job" terminology (SuspendJob/ResumeJob) rather than "Task" (like TerminateTask), which aligns with the PR comment requirement that these operations should only support Job objects.

Comment on lines 221 to 223
virtual CraneErrCode Suspend();
virtual CraneErrCode Resume();
virtual CraneErrCode Cleanup() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Virtual Suspend/Resume declared but no visible definitions for base or ContainerInstance override

ITaskInstance now declares virtual CraneErrCode Suspend(); and Resume();, and ContainerInstance overrides them, but there are no corresponding definitions in this header or in TaskManager.cpp as provided. Since TaskManager::SuspendRunningTasks_() / ResumeSuspendedTasks_() invoke task->Suspend() / task->Resume(), this will lead to undefined references at link time unless these methods are defined in another translation unit.

To make this safe and explicit:

class ITaskInstance {
  …
-  virtual CraneErrCode Prepare() = 0;
-  virtual CraneErrCode Spawn() = 0;
-  virtual CraneErrCode Kill(int signum) = 0;
-  virtual CraneErrCode Suspend();
-  virtual CraneErrCode Resume();
-  virtual CraneErrCode Cleanup() = 0;
+  virtual CraneErrCode Prepare() = 0;
+  virtual CraneErrCode Spawn() = 0;
+  virtual CraneErrCode Kill(int signum) = 0;
+  // Provide a concrete default or make these pure virtual and implement in all subclasses.
+  virtual CraneErrCode Suspend();
+  virtual CraneErrCode Resume();
+  virtual CraneErrCode Cleanup() = 0;

and then add concrete implementations (e.g., in TaskManager.cpp) for:

  • CraneErrCode ITaskInstance::Suspend()/Resume() (default behavior or error), and
  • CraneErrCode ContainerInstance::Suspend()/Resume() (container-specific behavior), plus ProcInstance::Suspend()/Resume() if processes also support suspend/resume.

If these definitions already exist in another file, ensure they’re committed and that their signatures match these declarations exactly.

Also applies to: 255-257

🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.h around lines 221-223 (and similarly
255-257), ITaskInstance now declares virtual CraneErrCode Suspend() and Resume()
and ContainerInstance/ProcInstance override them but no definitions are present,
causing potential undefined references; add concrete definitions in the
corresponding .cpp (e.g., TaskManager.cpp or a new TaskInstance.cpp): implement
ITaskInstance::Suspend() and ITaskInstance::Resume() with a safe default return
(e.g., a not-supported error or success), and implement
ContainerInstance::Suspend()/Resume() and ProcInstance::Suspend()/Resume() with
the container/process-specific suspend/resume logic; ensure the function
signatures exactly match the header, compile into the same target, and commit
the new/updated .cpp file so linkers find the symbols.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (1)

2023-2050: Add status check before suspending tasks.

The method should verify that the step is in a Running state before attempting to suspend tasks. Without this check, the method might attempt to suspend tasks in Pending, Completing, or other invalid states, leading to unexpected behavior.

Apply this diff to add the status check:

 CraneErrCode TaskManager::SuspendRunningTasks_() {
   CRANE_DEBUG("Suspending all running tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Running) {
+    CRANE_DEBUG("Step is not in Running state, cannot suspend.");
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to suspend

Based on learnings.

src/CraneCtld/TaskScheduler.cpp (1)

1297-1313: Use a real step id (kDaemonStepId) instead of 0 in StepStatusChangeAsync.

handle_failure in both suspend and resume paths calls:

  • StepStatusChangeAsync(task_id, 0, craned_id, ...)

CleanTaskStatusChangeQueueCb_ looks up a step by step_id (daemon or common step) and drops unknown steps as “Ignoring unknown step…”, so step id 0 will never be matched. That means these synthetic failure notifications won’t drive job cleanup: resources remain allocated and m_running_task_map_ may keep the job indefinitely if no other status change arrives. This is the same issue previously raised for the suspend/resume failure path.

To align with existing failure injection (see TerminateTasksOnCraned, which uses kDaemonStepId), use the daemon step id here as well:

-      for (const auto& craned_id : executing_nodes) {
-        StepStatusChangeAsync(
-            task_id, 0, craned_id, crane::grpc::TaskStatus::Failed,
-            ExitCode::EC_RPC_ERR, "",
-            google::protobuf::util::TimeUtil::GetCurrentTime());
-      }
+      for (const auto& craned_id : executing_nodes) {
+        StepStatusChangeAsync(
+            task_id, kDaemonStepId, craned_id,
+            crane::grpc::TaskStatus::Failed,
+            ExitCode::EC_RPC_ERR, "",
+            google::protobuf::util::TimeUtil::GetCurrentTime());
+      }

Apply the same change in both SuspendRunningTasks and ResumeSuspendedTasks.

Based on learnings

Also applies to: 1440-1456

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ffd32e and 29201ec.

📒 Files selected for processing (3)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (3 hunks)
  • src/Craned/Supervisor/TaskManager.h (6 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
🧬 Code graph analysis (2)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
  • TerminateRunningTask (849-857)
src/Craned/Core/JobManager.cpp (2)
  • StepStatusChangeAsync (1231-1240)
  • StepStatusChangeAsync (1231-1235)
src/Craned/Core/CtldClient.cpp (2)
  • StepStatusChangeAsync (564-573)
  • StepStatusChangeAsync (564-565)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
  • SuspendJobAsync (1687-1699)
  • SuspendJobAsync (1687-1687)
  • ResumeJobAsync (1701-1713)
  • ResumeJobAsync (1701-1701)
  • EvCleanTaskSignalQueueCb_ (2005-2021)
  • EvCleanTaskSignalQueueCb_ (2005-2005)
  • SuspendRunningTasks_ (2023-2050)
  • SuspendRunningTasks_ (2023-2023)
  • ResumeSuspendedTasks_ (2052-2078)
  • ResumeSuspendedTasks_ (2052-2052)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (7)
src/Craned/Supervisor/TaskManager.h (4)

255-257: Ensure ContainerInstance suspend/resume uses CRI pause/unpause APIs.

Container suspend/resume should leverage CRI runtime pause/unpause capabilities rather than POSIX signals (SIGSTOP/SIGCONT). Verify that the implementations call the appropriate CRI client methods (e.g., PauseContainer/UnpauseContainer) instead of routing through Kill(), which performs destructive teardown operations.

Based on learnings.


479-483: LGTM!

The TaskSignalQueueElem structure is properly defined with the Action enum and promise field, addressing the previous concern about missing type definitions.


454-457: LGTM: Async API design follows existing patterns.

The suspend/resume API design is consistent with the existing async patterns in TaskManager (e.g., ChangeTaskTimeLimitAsync). The use of promise/future, concurrent queue, and uvw async handle ensures thread-safe operation.

Also applies to: 500-500, 505-506, 533-534


221-222: Based on my thorough investigation, I found a critical issue that requires your verification:

ITaskInstance::Suspend() and Resume() lack visible implementations, yet code calls them polymorphically.

Key findings:

  1. ITaskInstance base class (lines 221-222): Declares virtual CraneErrCode Suspend(); and virtual CraneErrCode Resume(); as non-pure virtual (no = 0), meaning they require implementations.

  2. ContainerInstance (lines 255-256): Declares CraneErrCode Suspend() override; and CraneErrCode Resume() override; but no implementations found in the codebase.

  3. ProcInstance (lines 315-382): Does NOT declare or override Suspend/Resume, so it depends entirely on base class implementations.

  4. Runtime calls (TaskManager.cpp:2038, 2067): Code calls task->Suspend() and task->Resume() polymorphically on ITaskInstance pointers.

Despite exhaustive searches across all .cpp files in src/Craned/Supervisor/, no implementations with opening braces were found for either the base class or derived classes.

This should cause linker errors if implementations are genuinely missing. Since the repository apparently builds successfully, implementations likely exist but weren't detected. They may be:

  • In a separate source file outside src/Craned/Supervisor/
  • Defined through macro expansion
  • Provided as inline empty stubs elsewhere
src/Craned/Supervisor/TaskManager.cpp (3)

1494-1498: LGTM!

The async handle initialization follows the established pattern used for other async operations in the TaskManager constructor.


1687-1713: LGTM!

Both SuspendJobAsync() and ResumeJobAsync() correctly implement the async pattern: enqueue a signal element with a promise, trigger the async handle, and return a future. The implementation is thread-safe and consistent with other async methods.


2005-2021: LGTM!

The queue processing callback correctly dispatches to the appropriate suspend/resume helper and fulfills the promise with the result. The default case properly handles unexpected action values.

Comment on lines +1261 to +1277
if (iter == m_running_task_map_.end()) {
CRANE_TRACE("Task #{} not in Rn queue for suspend", task_id);
results.emplace_back(CraneErrCode::ERR_INVALID_PARAM);
continue;
}

TaskInCtld* task = iter->second.get();
if (task->Status() == crane::grpc::Suspended) {
CRANE_TRACE("Task #{} already suspended", task_id);
results.emplace_back(CraneErrCode::ERR_INVALID_PARAM);
continue;
}
if (task->Status() != crane::grpc::Running) {
CRANE_TRACE("Task #{} is not running (status {}) for suspend", task_id,
static_cast<int>(task->Status()));
results.emplace_back(CraneErrCode::ERR_INVALID_PARAM);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align error codes for “task not found” vs. “bad state” between suspend and resume.

In SuspendRunningTasks a missing task in m_running_task_map_ returns ERR_INVALID_PARAM, while ResumeSuspendedTasks returns ERR_NON_EXISTENT for the same condition. Both functions also use ERR_INVALID_PARAM for “wrong current status” (already suspended / not running).

For consistency with other APIs (e.g. ChangeTaskTimeLimit) and clearer client behavior, it would be better to:

  • Use ERR_NON_EXISTENT when the task id is not present.
  • Reserve ERR_INVALID_PARAM for “task exists but is in an incompatible state”.

That implies changing the suspend branch for iter == end() to return ERR_NON_EXISTENT.

Also applies to: 1409-1420

🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1261-1277 (and similarly for
lines 1409-1420), the branch that handles a missing task in m_running_task_map_
currently returns CraneErrCode::ERR_INVALID_PARAM; change that to
CraneErrCode::ERR_NON_EXISTENT to match ResumeSuspendedTasks and other APIs,
while leaving the branches that detect an existing task in an incompatible state
(already suspended / not running) returning ERR_INVALID_PARAM unchanged; apply
the same change in the second location (1409-1420).

Comment on lines +2052 to +2078
CraneErrCode TaskManager::ResumeSuspendedTasks_() {
CRANE_DEBUG("Resuming all suspended tasks");

// Daemon steps don't have actual task instances to resume
// Note: This should not be reached as daemon steps are filtered at RPC level
if (m_step_.IsDaemonStep()) {
CRANE_DEBUG("Daemon step has no tasks to resume, skipping.");
return CraneErrCode::SUCCESS;
}

CraneErrCode result = CraneErrCode::SUCCESS;

for (const auto& task_id : m_step_.task_ids) {
auto* task = m_step_.GetTaskInstance(task_id);
if (task) {
CraneErrCode err = task->Resume();
if (err != CraneErrCode::SUCCESS) {
CRANE_ERROR("Failed to resume task #{}: {}", task_id, CraneErrStr(err));
result = err;
} else {
CRANE_DEBUG("Task #{} resumed successfully", task_id);
}
}
}

return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add status check before resuming tasks.

Similar to suspend, the resume operation should verify that the step is in an appropriate state (e.g., Suspended) before attempting to resume tasks. This prevents attempting to resume tasks that are not suspended or are in other invalid states.

Apply this diff to add the status check:

 CraneErrCode TaskManager::ResumeSuspendedTasks_() {
   CRANE_DEBUG("Resuming all suspended tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Suspended) {
+    CRANE_DEBUG("Step is not in Suspended state, cannot resume.");
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to resume

Based on learnings.

🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2052 to 2078, add a state
check before resuming tasks: verify the step is in the Suspended state (e.g.,
via m_step_.GetStatus() == StepStatus::Suspended or the project’s equivalent
accessor) and if not, log a debug/error and return an appropriate CraneErrCode
(e.g., CraneErrCode::INVALID_STATE or CraneErrCode::SUCCESS if you prefer a
no-op) instead of iterating over task_ids; only enter the loop and call
task->Resume() when the step is confirmed Suspended.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (2)

2108-2134: Missing state validation before resume operation.

Similar to suspend, this method should verify that the step is in Suspended state before attempting to resume tasks.

Consider adding a state check:

 CraneErrCode TaskManager::ResumeSuspendedTasks_() {
   CRANE_DEBUG("Resuming all suspended tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Suspended) {
+    CRANE_DEBUG("Step is not in Suspended state (current: {}), cannot resume.",
+                g_runtime_status.Status.load());
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to resume

Based on learnings.


2079-2106: Missing state validation before suspend operation.

The method doesn't verify that the step is in Running state before attempting to suspend tasks. While daemon steps are filtered, other invalid states (e.g., Completing, Pending) should also be checked to prevent unexpected behavior.

Consider adding a state check at the beginning:

 CraneErrCode TaskManager::SuspendRunningTasks_() {
   CRANE_DEBUG("Suspending all running tasks");
 
+  if (g_runtime_status.Status.load() != StepStatus::Running) {
+    CRANE_DEBUG("Step is not in Running state (current: {}), cannot suspend.",
+                g_runtime_status.Status.load());
+    return CraneErrCode::ERR_INVALID_PARAM;
+  }
+
   // Daemon steps don't have actual task instances to suspend

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f32f58 and 4cb0364.

📒 Files selected for processing (2)
  • src/Craned/Supervisor/TaskManager.cpp (5 hunks)
  • src/Craned/Supervisor/TaskManager.h (7 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (2)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
  • Suspend (1393-1405)
  • Suspend (1393-1393)
  • Suspend (1432-1444)
  • Suspend (1432-1432)
  • Resume (1407-1419)
  • Resume (1407-1407)
  • Resume (1446-1458)
  • Resume (1446-1446)
  • SuspendJobAsync (1743-1755)
  • SuspendJobAsync (1743-1743)
  • ResumeJobAsync (1757-1769)
  • ResumeJobAsync (1757-1757)
  • EvCleanTaskSignalQueueCb_ (2061-2077)
  • EvCleanTaskSignalQueueCb_ (2061-2061)
  • SuspendRunningTasks_ (2079-2106)
  • SuspendRunningTasks_ (2079-2079)
  • ResumeSuspendedTasks_ (2108-2134)
  • ResumeSuspendedTasks_ (2108-2108)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Supervisor/TaskManager.h (3)
  • Suspend (255-255)
  • m_pid_ (380-380)
  • Resume (256-256)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (8)
src/Craned/Supervisor/TaskManager.h (4)

221-222: LGTM!

The pure virtual Suspend() and Resume() methods are correctly declared in the base interface, with appropriate overrides in derived classes (ContainerInstance and ProcInstance). Implementations are present in the .cpp file.


255-256: Container suspend/resume returns ERR_NOT_SUPPORTED - acceptable for this PR.

Containers do not currently support suspend/resume operations. This is a reasonable limitation given the PR scope focuses on process-based job suspend/resume. Future enhancement could add CRI-level pause/unpause support.


456-458: LGTM!

The asynchronous suspend/resume API follows the established pattern in TaskManager (consistent with ExecuteTaskAsync, ChangeTaskTimeLimitAsync, etc.) and properly returns futures for error handling.


481-485: LGTM!

The TaskSignalQueueElem structure is now properly defined (addresses previous review concern). The design with an action enum and promise for async result delivery is clean and follows the existing queue element patterns in this class.

src/Craned/Supervisor/TaskManager.cpp (4)

1393-1419: Process suspend/resume implementation is correct.

The implementations correctly use:

  • kill(-m_pid_, SIGSTOP) and kill(-m_pid_, SIGCONT) to signal the entire process group (negative PID)
  • Appropriate error checking and return codes
  • Guard against zero PID

1550-1554: LGTM!

The async handle initialization for task signal processing follows the established pattern used for other async operations in TaskManager.


1743-1769: LGTM!

Both SuspendJobAsync() and ResumeJobAsync() correctly implement the async queue pattern:

  1. Create promise/future pair
  2. Enqueue request with promise
  3. Trigger async handler
  4. Return future to caller

2061-2077: LGTM!

The queue callback correctly dispatches suspend/resume actions to the appropriate helper methods and fulfills the associated promises with results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

1566-1571: Suspend/resume queue wiring is fine, but step state should be validated before acting

The async wiring—m_task_signal_async_handle_ driving EvCleanTaskSignalQueueCb_(), which dispatches Suspend/Resume actions to SuspendRunningTasks_() / ResumeSuspendedTasks_()—is consistent with the rest of TaskManager’s design and looks correct from a concurrency perspective.

However, unlike terminate/time-limit operations, SuspendRunningTasks_() and ResumeSuspendedTasks_() do not check the step’s runtime status (e.g., Running vs Suspended vs Completing) before sending signals. That means suspend/resume can be invoked while the step is in an incompatible state, which may conflict with the intended lifecycle gating enforced elsewhere via g_runtime_status and CanStepOperate().

Consider:

  • For suspend: ensure the step is currently Running (and not Completing/Completed/Pending) before iterating tasks.
  • For resume: ensure the step is currently Suspended before attempting to resume.

Aligning these helpers with the existing state checks used in terminate/time-limit handlers would make suspend/resume behavior more predictable and avoid surprising no-op or partial effects when the step is mid-transition.

Also applies to: 1759-1785, 2077-2150

🧹 Nitpick comments (4)
src/Craned/Supervisor/TaskManager.h (1)

456-458: Async suspend/resume plumbing fits existing queue/event pattern

The new TaskSignalQueueElem, m_task_signal_async_handle_, and SuspendJobAsync / ResumeJobAsync mirror the existing async patterns (ChangeTaskTimeLimit, terminate queue, etc.) and look correct: promises are moved into the queue element and fulfilled in the event-loop callback.

One thing to keep in mind is the semantic naming: SuspendJobAsync / ResumeJobAsync operate at the step/supervisor level here. If you intend suspend/resume to be job-only at the external API, it may be worth clarifying in comments or docs that this is the per-step handler invoked for a job-level operation.

Also applies to: 481-485, 502-503, 507-509, 535-537

src/Craned/Common/CgroupManager.cpp (2)

32-150: Freezer implementation and public Freeze/Thaw API look good, but rely on cg_path being a controller-agnostic path

The new ChangeFreezerStateByPath() plus FreezeCgroupByPath / ThawCgroupByPath correctly:

  • Handle v2 via cg_path / "cgroup.freeze" with a short-circuit when the state is already desired.
  • Handle v1 by mapping a generic cgroup path under kSystemCgPathPrefix to the freezer hierarchy and using freezer.state, again with a short-circuit when already FROZEN/THAWED.

This is a solid design, but it assumes that cg_path is a controller-agnostic path under kSystemCgPathPrefix (e.g., /sys/fs/cgroup/crane/job_...), not an already controller-specific path (like /sys/fs/cgroup/freezer/...). Passing a controller-specific path would effectively prepend the controller twice.

As long as all callers (e.g., StepInstance::cgroup_path used in Supervisor suspend/resume) follow the generic-path convention, this is fine; it’s worth keeping that invariant in mind when adding future call sites.

Also applies to: 894-900


832-892: OOM-counts-by-path helper aligns with freezer path assumptions

ReadOomCountsFromCgroupPath() now:

  • For v2, directly uses the provided cg_path and reads memory.events.
  • For v1, derives a memory-controller-specific path under kSystemCgPathPrefix and reads memory.oom_control.

This matches the same “controller-agnostic cg_path” assumption as the freezer helpers. The logic itself is straightforward and preserves the previous behavior of zeroing outputs and returning false on any failure. If you ever need to support callers passing controller-specific paths, you may want to factor this to reuse BuildControllerSpecificPath() and explicitly validate the relationship, but it’s not required given current usage.

src/Craned/Common/CgroupManager.h (1)

113-123: New freezer and BPF constants are coherent with implementation

The additions for:

  • kFreezerStateFileV1 / kCgroupFreezeFileV2, and
  • BPF paths/names (kBpfObjectFilePath, kBpfDeviceMapFilePath, kBpfMapName, kBpfProgramName)

match the usage patterns in CgroupManager.cpp. Hard-coding these paths is consistent with the existing design; if deployment environments diverge, configuration knobs might be a future enhancement, but nothing is blocking here.

Also applies to: 307-337

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb0364 and b52209d.

📒 Files selected for processing (4)
  • src/Craned/Common/CgroupManager.cpp (41 hunks)
  • src/Craned/Common/CgroupManager.h (16 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (5 hunks)
  • src/Craned/Supervisor/TaskManager.h (7 hunks)
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/Craned/Common/CgroupManager.cpp
  • src/Craned/Common/CgroupManager.h
📚 Learning: 2025-04-02T09:52:59.318Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:697-701
Timestamp: 2025-04-02T09:52:59.318Z
Learning: When using bpf_map__get_next_key function, memory must be properly allocated (e.g., with std::make_unique<BpfKey>()) before passing the pointer to the function, as it writes the key to the provided memory address.

Applied to files:

  • src/Craned/Common/CgroupManager.cpp
  • src/Craned/Common/CgroupManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.

Applied to files:

  • src/Craned/Common/CgroupManager.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.

Applied to files:

  • src/Craned/Common/CgroupManager.h
🧬 Code graph analysis (4)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
  • Suspend (786-806)
  • Suspend (786-786)
  • Suspend (1437-1449)
  • Suspend (1437-1437)
  • Resume (808-828)
  • Resume (808-808)
  • Resume (1451-1463)
  • Resume (1451-1451)
  • SuspendJobAsync (1759-1771)
  • SuspendJobAsync (1759-1759)
  • ResumeJobAsync (1773-1785)
  • ResumeJobAsync (1773-1773)
  • EvCleanTaskSignalQueueCb_ (2077-2093)
  • EvCleanTaskSignalQueueCb_ (2077-2077)
  • SuspendRunningTasks_ (2095-2122)
  • SuspendRunningTasks_ (2095-2095)
  • ResumeSuspendedTasks_ (2124-2150)
  • ResumeSuspendedTasks_ (2124-2124)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Supervisor/TaskManager.h (2)
  • GetParentStepInstance (210-210)
  • m_pid_ (380-380)
src/Craned/Common/CgroupManager.cpp (4)
  • FreezeCgroupByPath (894-896)
  • FreezeCgroupByPath (894-894)
  • ThawCgroupByPath (898-900)
  • ThawCgroupByPath (898-898)
src/Craned/Common/CgroupManager.cpp (1)
src/Craned/Common/CgroupManager.h (1)
  • handle (413-414)
src/Craned/Common/CgroupManager.h (1)
src/Craned/Common/CgroupManager.cpp (18)
  • ReadOomCountsFromCgroupPath (832-892)
  • ReadOomCountsFromCgroupPath (832-834)
  • FreezeCgroupByPath (894-896)
  • FreezeCgroupByPath (894-894)
  • ThawCgroupByPath (898-900)
  • ThawCgroupByPath (898-898)
  • CreateOrOpen_ (437-584)
  • CreateOrOpen_ (437-439)
  • GetJobIdsFromCgroupV2_ (680-700)
  • GetJobIdsFromCgroupV2_ (680-681)
  • task_id_t (735-775)
  • GetJobBpfMapCgroupsV2_ (736-737)
  • ParseIdsFromCgroupStr_ (410-427)
  • ParseIdsFromCgroupStr_ (410-411)
  • InitializeController_ (348-391)
  • InitializeController_ (348-351)
  • GetCgJobIdMapCgroupV2_ (702-731)
  • GetCgJobIdMapCgroupV2_ (702-703)
🔇 Additional comments (5)
src/Craned/Supervisor/TaskManager.h (1)

221-222: Suspend/Resume interface and overrides are consistent and safe

Making Suspend() / Resume() pure virtual on ITaskInstance and overriding them in both ContainerInstance and ProcInstance matches the concrete implementations in TaskManager.cpp and keeps the abstraction clean. No issues from an interface or inheritance standpoint.

Also applies to: 255-256, 330-331

src/Craned/Supervisor/TaskManager.cpp (2)

786-828: Container suspend/resume via cgroup freezer looks correct

Using CgroupManager::FreezeCgroupByPath / ThawCgroupByPath on step->cgroup_path with existence checks and clear logging is a reasonable container suspend/resume implementation for both cgroup v1 and v2. Returning ERR_NON_EXISTENT when cgroup_path is empty or missing should give callers a clear signal that suspend/resume isn’t available for this step.


1437-1463: Process-based suspend/resume via SIGSTOP/SIGCONT is straightforward

ProcInstance::Suspend() / Resume() send SIGSTOP / SIGCONT to the process group (-m_pid_), handle errors, and distinguish between “no pid” and generic failure. This matches the expected semantics for pausing/resuming batch/interactive tasks without changing existing kill behavior.

src/Craned/Common/CgroupManager.cpp (1)

653-775: Const-correctness and BPF/resource allocator adjustments are consistent with existing design

The various updates here:

  • Using std::string_view::data() for controller names passed into libcgroup (cgroup_walk_tree_*, cgroup_get_procs) and keeping the views backed by static arrays.
  • The BPF helpers (GetJobBpfMapCgroupsV2_, BpfRuntimeInfo methods) now work with std::make_unique<BpfKey>() for key storage and reuse, consistent with prior learnings.
  • AllocatableResourceAllocator::Allocate and DedicatedResourceAllocator::Allocate now take resources by const& (both internal and gRPC types), with no change to semantics.

All of this looks internally consistent and doesn’t introduce new behavioral risks.

Also applies to: 1150-1728

src/Craned/Common/CgroupManager.h (1)

340-377: Header declarations for cgroup classes and freezer API match the cpp and improve const-correctness

  • Cgroup, CgroupInterface, CgroupV1, and CgroupV2 constructors now take const std::string& and explicit struct cgroup*, matching the implementations.
  • SetDeviceAccess now takes const std::unordered_set<SlotId>&, and allocator methods take resources by const&, avoiding unnecessary copies.
  • The new CgroupManager::FreezeCgroupByPath / ThawCgroupByPath declarations align with the definitions in the .cpp.

Everything here is consistent with the implementation and preserves the existing public API semantics while making usage slightly cleaner.

Also applies to: 378-390, 497-503, 561-565, 586-590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants