-
Notifications
You must be signed in to change notification settings - Fork 32
feat: add suspend and resume #632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdds suspend/resume job support across the system: new protobuf messages/enums and RPCs (CraneCtld, Craned, Supervisor), supervisor async suspend/resume queue and task-level suspend/resume implementations, TaskScheduler orchestration for suspend/resume, Craned/CranedKeeper RPC clients, and multiple const-ref/formatting signature normalizations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CraneCtld as CraneCtld\n(TaskScheduler)
participant Craned as Craned\n(CranedService)
participant Supervisor as Supervisor\n(TaskManager)
participant Task as TaskInstance
Note over Client,CraneCtld: Suspend flow
Client->>CraneCtld: ModifyTask(task_id, Suspend)
CraneCtld->>CraneCtld: SuspendRunningTasks()
CraneCtld->>Craned: SuspendJobs RPC (broadcast)
Craned->>Supervisor: SuspendJob RPC (per-job)
Supervisor->>Supervisor: SuspendJobAsync() enqueue/process
Supervisor->>Task: Suspend() (SIGSTOP / freeze cgroup)
Task-->>Supervisor: result
Supervisor-->>Craned: ok/reason
Craned-->>CraneCtld: aggregated results
CraneCtld->>CraneCtld: mark Suspended + persist / or terminate on failure
CraneCtld-->>Client: response (modified/not_modified)
Note over Client,CraneCtld: Resume flow
Client->>CraneCtld: ModifyTask(task_id, Resume)
CraneCtld->>CraneCtld: ResumeSuspendedTasks()
CraneCtld->>Craned: ResumeJobs RPC (broadcast)
Craned->>Supervisor: ResumeJob RPC (per-job)
Supervisor->>Supervisor: ResumeJobAsync() enqueue/process
Supervisor->>Task: Resume() (SIGCONT / thaw cgroup)
Task-->>Supervisor: result
Supervisor-->>Craned: ok/reason
Craned-->>CraneCtld: aggregated results
CraneCtld->>CraneCtld: mark Running + persist / or terminate on failure
CraneCtld-->>Client: response (modified/not_modified)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas needing extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
protos/Crane.proto(3 hunks)protos/PublicDefs.proto(1 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/CranedServer.cpp(2 hunks)src/Craned/Core/CranedServer.h(1 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Core/JobManager.h(2 hunks)src/Craned/Core/SupervisorKeeper.cpp(2 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(2 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CranedServer.cpp
🧬 Code graph analysis (12)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendTask(86-100)SuspendTask(86-86)ResumeTask(102-116)ResumeTask(102-102)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendTask(89-103)SuspendTask(89-92)ResumeTask(105-119)ResumeTask(105-108)
src/Craned/Core/JobManager.h (1)
src/Craned/Core/JobManager.cpp (4)
SuspendStep(989-1011)SuspendStep(989-989)ResumeStep(1013-1035)ResumeStep(1013-1013)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendTask(86-100)SuspendTask(86-86)ResumeTask(102-116)ResumeTask(102-102)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendTask(89-103)SuspendTask(89-92)ResumeTask(105-119)ResumeTask(105-108)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendSteps(241-268)SuspendSteps(241-241)ResumeSteps(270-297)ResumeSteps(270-270)src/Craned/Core/CranedServer.cpp (4)
SuspendSteps(88-113)SuspendSteps(88-91)ResumeSteps(115-140)ResumeSteps(115-118)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendSteps(241-268)SuspendSteps(241-241)ResumeSteps(270-297)ResumeSteps(270-270)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SuspendRunningTask(999-1062)SuspendRunningTask(999-999)ResumeSuspendedTask(1064-1123)ResumeSuspendedTask(1064-1064)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendTask(86-100)SuspendTask(86-86)ResumeTask(102-116)ResumeTask(102-102)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendTask(89-103)SuspendTask(89-92)ResumeTask(105-119)ResumeTask(105-108)
src/Craned/Core/CranedServer.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendSteps(241-268)SuspendSteps(241-241)ResumeSteps(270-297)ResumeSteps(270-270)src/Craned/Core/CranedServer.cpp (4)
SuspendSteps(88-113)SuspendSteps(88-91)ResumeSteps(115-140)ResumeSteps(115-118)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
g_embedded_db_client(42-45)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
SuspendSteps(88-113)SuspendSteps(88-91)ResumeSteps(115-140)ResumeSteps(115-118)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(103-106)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
SuspendTaskAsync(1694-1701)SuspendTaskAsync(1694-1694)ResumeTaskAsync(1703-1710)ResumeTaskAsync(1703-1703)EvCleanTaskSignalQueueCb_(1951-1978)EvCleanTaskSignalQueueCb_(1951-1951)SuspendRunningTasks_(1980-1995)SuspendRunningTasks_(1980-1980)ResumeSuspendedTasks_(1997-2012)ResumeSuspendedTasks_(1997-1997)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: format
🔇 Additional comments (12)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
588-606: LGTM!The suspend branch correctly calls
SuspendRunningTaskand provides appropriate error messages for different failure scenarios.
607-625: LGTM!The resume branch correctly calls
ResumeSuspendedTaskand provides clear error messages distinguishing between non-existent tasks and tasks that are not currently suspended.src/Craned/Core/JobManager.h (2)
40-40: LGTM!The
suspendedflag addition toStepInstanceappropriately tracks the suspended state for each step, and is correctly default-initialized tofalse.
107-109: LGTM!The method declarations are well-positioned in the public API and follow the established pattern for task lifecycle management operations.
protos/PublicDefs.proto (1)
117-117: LGTM!The
Suspendedstatus addition is correctly positioned in theTaskStatusenum and aligns with the new suspend/resume lifecycle semantics implemented across the PR.src/Craned/Core/SupervisorKeeper.h (1)
38-39: LGTM!The method declarations are appropriately positioned in the
SupervisorStubclass and follow the established pattern for supervisor RPC wrappers.src/CraneCtld/TaskScheduler.h (1)
488-489: LGTM!The method declarations extend the public API surface appropriately and align with the implementations in TaskScheduler.cpp for suspending and resuming running tasks.
src/Craned/Core/SupervisorKeeper.cpp (3)
22-22: LGTM!The include addition is necessary for the
CraneErrStrfunction used in the error mapping logic of the new RPC wrappers.
86-100: LGTM!The
SuspendTaskimplementation correctly follows the established RPC wrapper pattern with appropriate error handling and logging.
102-116: LGTM!The
ResumeTaskimplementation mirrorsSuspendTaskand correctly handles RPC invocation and error mapping.src/CraneCtld/RpcService/CranedKeeper.h (1)
89-91: LGTM!The method declarations appropriately extend the
CranedStubpublic API for suspend/resume operations, following the established pattern for multi-step RPC operations.src/Craned/Core/CranedServer.h (1)
70-76: LGTM!The RPC handler declarations correctly extend the
CranedServiceImplinterface for suspend/resume operations, following the established gRPC service pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1908-1925: Preserve suspended state through termination flow.Clearing
m_step_.suspendedat the top of termination makes concurrent resume requests succeed while the task is still SIGSTOP’ed. Wait until the termination completes (or the status change arrives) before resetting it, otherwise higher layers believe the task is resumable and hit invalid transitions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Supervisor/TaskManager.cpp(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/Craned/Core/JobManager.cpp (2)
975-977: Don’t clear suspension before termination succeedsThis is the same issue previously flagged: you zero
step_inst->suspendedfor every orphaned step before we know whetherTerminateTaskwill succeed. If the RPC fails we leave the supervisor running but recordsuspended = false, so a controller retry mis-evaluates state. Only flip the flag after termination completes (or in the status-change callback).
1004-1014: Check supervisor availability before validating stateWe’re still dereferencing
step_instand returningERR_INVALID_PARAMbefore verifying that the supervisor stub exists. If the supervisor already disappeared we must returnERR_RPC_FAILURE; otherwise upper layers mis-handle the retry. Acquire the stub first, fail withERR_RPC_FAILUREwhen it’s missing, then validatestep_inst->suspended.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/TaskScheduler.cpp(2 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
g_embedded_db_client(42-45)
🔇 Additional comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1022-1033: Nice improvementGreat to see
ContainerInstance::Killshort-circuit SIGSTOP/SIGCONT. This keeps suspend/resume from tearing containers down, matching the earlier feedback.
protos/Supervisor.proto
Outdated
| string reason = 2; | ||
| } | ||
|
|
||
| message SuspendTaskRequest {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task->Job
| return CraneErrCode::SUCCESS; | ||
| } | ||
|
|
||
| CraneErrCode CranedStub::SuspendSteps(const std::vector<task_id_t> &task_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector<job_id_t>
src/CraneCtld/TaskScheduler.cpp
Outdated
| return std::move(future); | ||
| } | ||
|
|
||
| CraneErrCode TaskScheduler::SuspendRunningTask(task_id_t task_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不支持批量吗?
src/CraneCtld/TaskScheduler.cpp
Outdated
| std::vector<CranedId> suspended_nodes; | ||
| CraneErrCode failure_code = CraneErrCode::SUCCESS; | ||
| bool has_failure = false; | ||
| for (const auto& craned_id : executing_nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
先查下任务所有节点是不是Up
|
文档需要在仓库doc下更新 |
60b6694 to
224d2c7
Compare
src/Craned/Core/JobManager.h
Outdated
| struct StepInstance { | ||
| StepToD step_to_d; | ||
| pid_t supv_pid; | ||
| bool suspended{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接用TaskStatus
src/Craned/Supervisor/TaskManager.h
Outdated
| bool pty; | ||
| bool x11; | ||
| bool x11_fwd; | ||
| bool suspended{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也用TaskStatus
|
这两个操作应该都只支持对Job进行操作 |
| return CraneErrCode::ERR_GENERIC_FAILURE; | ||
| } | ||
|
|
||
| CraneErrCode CranedStub::ResumeSteps(const std::vector<task_id_t> &task_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job_id_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/CraneCtld/TaskScheduler.cpp (3)
1055-1072: Track rollback success and update Ctld state consistently after rollback.After rolling back the suspended nodes (by calling
ResumeSteps), the code returns the originalfailure_codewithout verifying if the rollback succeeded. If rollback fails partially, the controller's view of task state diverges from the actual craned state. After successful rollback, the task should explicitly remain inRunningstate, but if rollback fails, the state is now inconsistent and should be logged or handled differently.Based on learnings from past review comments (lines 1056-1073, 1146-1164).
Track rollback outcome:
if (has_failure) { + bool rollback_failed = false; for (const auto& craned_id : suspended_nodes) { auto stub = g_craned_keeper->GetCranedStub(craned_id); if (!stub || stub->Invalid()) { CRANE_ERROR( "Failed to rollback suspended state on craned {}: stub unavailable", craned_id); + rollback_failed = true; continue; } CraneErrCode err = stub->ResumeSteps(job_ids); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR("Rollback ResumeSteps for task #{} on craned {} failed: {}", task_id, craned_id, CraneErrStr(err)); + rollback_failed = true; } } + if (rollback_failed) { + CRANE_ERROR( + "Task #{} is in inconsistent state after failed rollback. " + "Manual intervention may be required.", task_id); + } return failure_code; }
1033-1053: Use thread pool for parallel RPC calls instead of sequential iteration.The current implementation calls
SuspendStepssequentially on each executing node, which can be slow when a task spans many nodes. For better performance, dispatch these RPCs in parallel usingg_thread_pool, similar to howCreateCgroupForTasksis handled (lines 643-696).Based on learnings from past review comment (line 1050).
Refactor to use parallel execution:
std::vector<task_id_t> job_ids{task_id}; std::vector<CranedId> suspended_nodes; + Mutex mtx; CraneErrCode failure_code = CraneErrCode::SUCCESS; bool has_failure = false; + absl::BlockingCounter counter(executing_nodes.size()); + for (const auto& craned_id : executing_nodes) { - auto stub = g_craned_keeper->GetCranedStub(craned_id); - if (!stub || stub->Invalid()) { - CRANE_WARN("SuspendSteps stub for {} unavailable", craned_id); - if (failure_code == CraneErrCode::SUCCESS) - failure_code = CraneErrCode::ERR_RPC_FAILURE; - has_failure = true; - continue; - } - - CraneErrCode err = stub->SuspendSteps(job_ids); - if (err != CraneErrCode::SUCCESS) { - CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id, - craned_id, CraneErrStr(err)); - if (failure_code == CraneErrCode::SUCCESS) failure_code = err; - has_failure = true; - continue; - } - - suspended_nodes.push_back(craned_id); + g_thread_pool->detach_task([&, craned_id]() { + auto stub = g_craned_keeper->GetCranedStub(craned_id); + if (!stub || stub->Invalid()) { + CRANE_WARN("SuspendSteps stub for {} unavailable", craned_id); + LockGuard guard(&mtx); + if (failure_code == CraneErrCode::SUCCESS) + failure_code = CraneErrCode::ERR_RPC_FAILURE; + has_failure = true; + counter.DecrementCount(); + return; + } + + CraneErrCode err = stub->SuspendSteps(job_ids); + LockGuard guard(&mtx); + if (err != CraneErrCode::SUCCESS) { + CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id, + craned_id, CraneErrStr(err)); + if (failure_code == CraneErrCode::SUCCESS) failure_code = err; + has_failure = true; + } else { + suspended_nodes.push_back(craned_id); + } + counter.DecrementCount(); + }); } + counter.Wait();
1124-1144: Use thread pool for parallel RPC calls in ResumeSuspendedTask.Same performance issue as
SuspendRunningTask: sequentialResumeStepscalls should be parallelized usingg_thread_poolfor better performance when a task spans many nodes.Apply the same parallel execution pattern shown in the previous comment:
std::vector<task_id_t> job_ids{task_id}; std::vector<CranedId> resumed_nodes; + Mutex mtx; CraneErrCode failure_code = CraneErrCode::SUCCESS; bool has_failure = false; + absl::BlockingCounter counter(executing_nodes.size()); + for (const auto& craned_id : executing_nodes) { - auto stub = g_craned_keeper->GetCranedStub(craned_id); - if (!stub || stub->Invalid()) { - CRANE_WARN("ResumeSteps stub for {} unavailable", craned_id); - if (failure_code == CraneErrCode::SUCCESS) - failure_code = CraneErrCode::ERR_RPC_FAILURE; - has_failure = true; - continue; - } - - CraneErrCode err = stub->ResumeSteps(job_ids); - if (err != CraneErrCode::SUCCESS) { - CRANE_ERROR("Failed to resume task #{} on craned {}: {}", task_id, - craned_id, CraneErrStr(err)); - if (failure_code == CraneErrCode::SUCCESS) failure_code = err; - has_failure = true; - continue; - } - - resumed_nodes.push_back(craned_id); + g_thread_pool->detach_task([&, craned_id]() { + auto stub = g_craned_keeper->GetCranedStub(craned_id); + if (!stub || stub->Invalid()) { + CRANE_WARN("ResumeSteps stub for {} unavailable", craned_id); + LockGuard guard(&mtx); + if (failure_code == CraneErrCode::SUCCESS) + failure_code = CraneErrCode::ERR_RPC_FAILURE; + has_failure = true; + counter.DecrementCount(); + return; + } + + CraneErrCode err = stub->ResumeSteps(job_ids); + LockGuard guard(&mtx); + if (err != CraneErrCode::SUCCESS) { + CRANE_ERROR("Failed to resume task #{} on craned {}: {}", task_id, + craned_id, CraneErrStr(err)); + if (failure_code == CraneErrCode::SUCCESS) failure_code = err; + has_failure = true; + } else { + resumed_nodes.push_back(craned_id); + } + counter.DecrementCount(); + }); } + counter.Wait();
🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)
1029-1032: Validate that all executing nodes are UP before attempting suspend.Before sending
SuspendStepsRPCs, check that all executing nodes are online usingg_meta_container->CheckCranedOnline(). This prevents wasting time attempting RPCs to down nodes and provides clearer error messages.Based on learnings from past review comment (line 1033).
Add node status validation:
std::vector<task_id_t> job_ids{task_id}; + // Validate all nodes are online before attempting suspend + for (const auto& craned_id : executing_nodes) { + if (!g_meta_container->CheckCranedOnline(craned_id)) { + CRANE_ERROR("Task #{} cannot be suspended: craned {} is offline", + task_id, craned_id); + return CraneErrCode::ERR_CRANED_DOWN; + } + } + std::vector<CranedId> suspended_nodes;
999-999: Consider adding batch suspend/resume support.The current implementation only handles a single task at a time. Based on learnings from the comment on line 999 ("不支持批量吗?" - No batch support?), consider whether batch operations would be useful for administrative workflows (e.g., suspending all tasks in a partition or for a user).
Do you want batch suspend/resume functionality (operating on multiple task IDs in a single call)? If so, I can help design and generate the batch operation methods.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/CraneCtld/TaskScheduler.cpp(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
g_embedded_db_client(42-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (5)
src/Craned/Supervisor/SupervisorServer.h (1)
63-71: Same naming inconsistency:SuspendJobvs.ResumeTask.The declarations mirror the inconsistency noted in the implementation:
SuspendJobat line 63 vs.ResumeTaskat line 68. These should use consistent terminology.src/Craned/Core/JobManager.cpp (3)
860-863: Critical: Update only the specific step identified by status_change.This loop updates the status for every
StepInstanceinjob_ptr->step_map, butstatus_changeshould identify a single step. When step A reports a status change, this incorrectly overwrites the status for steps B, C, etc., causing state corruption across unrelated steps.Use
status_change.step_idto locate and update only the target step:- for (auto& [step_id, step_inst] : job_ptr->step_map) { - if (!step_inst) continue; - step_inst->status = status_change.new_status; - } + auto step_iter = job_ptr->step_map.find(status_change.step_id); + if (step_iter == job_ptr->step_map.end()) { + // Fallback: scan for matching step_id if not found by key + for (auto& [key, step_inst] : job_ptr->step_map) { + if (step_inst && step_inst->step_to_d.task_id() == status_change.step_id) { + step_iter = job_ptr->step_map.find(key); + break; + } + } + } + if (step_iter != job_ptr->step_map.end() && step_iter->second) { + step_iter->second->status = status_change.new_status; + }
968-970: Major: Don't reset step status before termination completes.Setting all steps to
Runningbefore termination occurs clears anySuspendedstate. If termination later fails (e.g., stub unavailable), the supervisor keeps running but we've lost the suspended flag, so subsequent suspend/resume operations will mis-evaluate state.Remove this blanket status reset and let the supervisor's termination-confirmation event update the status:
auto* instance = job_instance.get(); instance->orphaned = elem.mark_as_orphaned; - for (auto& [step_id, step_inst] : instance->step_map) { - if (step_inst) step_inst->status = crane::grpc::TaskStatus::Running; - }
990-1013: Major: Check supervisor stub availability before validating state.Checking
step_inst->statusbefore verifying the stub exists returnsERR_INVALID_PARAMwhen the real issue is a missing supervisor (should beERR_RPC_FAILURE). This masks the actual failure mode and prevents callers from retrying on another node.Swap the order: acquire stub first, validate state second:
auto* step_inst = job_instance->step_map.begin()->second.get(); - if (step_inst->status == crane::grpc::TaskStatus::Suspended) { - CRANE_TRACE("Task #{} already suspended", step_id); - return CraneErrCode::ERR_INVALID_PARAM; - } - auto stub = g_supervisor_keeper->GetStub(step_id); if (!stub) { CRANE_ERROR("Supervisor for task #{} not found when suspending", step_id); return CraneErrCode::ERR_RPC_FAILURE; } + + if (step_inst->status == crane::grpc::TaskStatus::Suspended) { + CRANE_TRACE("Task #{} already suspended", step_id); + return CraneErrCode::ERR_INVALID_PARAM; + }Apply the same fix to
ResumeStep(lines 1015-1038).src/Craned/Supervisor/TaskManager.cpp (1)
1992-2024: Critical: Continue signaling all tasks even after errors.Both helpers return immediately on the first
Kill()error (lines 2002, 2019). If task A's signal succeeds but task B's fails, you've left task A suspended/resumed while reporting failure and not updatingm_step_.status. Subsequent suspend/resume calls then see inconsistent state (status says Running but processes are stopped, or vice versa).Continue sending signals to all tasks, track the first error, and only flip status when all signals succeed:
CraneErrCode TaskManager::SuspendRunningTasks_() { if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT; bool any_signal_sent = false; + CraneErrCode first_error = CraneErrCode::SUCCESS; for (const auto& [pid, task_id] : m_pid_task_id_map_) { auto task = m_step_.GetTaskInstance(task_id); if (!task) continue; if (task->GetPid() == 0) continue; any_signal_sent = true; CraneErrCode err = task->Kill(SIGSTOP); - if (err != CraneErrCode::SUCCESS) return err; + if (err != CraneErrCode::SUCCESS && first_error == CraneErrCode::SUCCESS) + first_error = err; } if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM; - return CraneErrCode::SUCCESS; + return first_error; }Apply the same fix to
ResumeSuspendedTasks_()(lines 2009-2024).
🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)
1095-1116: Consider tracking rollback success for better operational visibility.The rollback loop logs individual node failures but doesn't track overall rollback success. If rollback partially fails, operators lack visibility into the inconsistent controller/craned state.
Apply this diff to track rollback outcomes:
if (has_failure.load(std::memory_order_relaxed)) { + bool rollback_failed = false; for (const auto& craned_id : suspended_nodes) { auto stub = g_craned_keeper->GetCranedStub(craned_id); if (!stub || stub->Invalid()) { CRANE_ERROR( "Failed to rollback suspended state on craned {}: stub " "unavailable", craned_id); + rollback_failed = true; continue; } CraneErrCode err = stub->ResumeSteps(job_ids); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR( "Rollback ResumeSteps for task #{} on craned {} failed: {}", task_id, craned_id, CraneErrStr(err)); + rollback_failed = true; } } + if (rollback_failed) { + CRANE_ERROR( + "Task #{} is in inconsistent state after failed rollback. " + "Manual intervention may be required.", task_id); + } results.emplace_back(failure_code); continue; }Based on learnings from past review comments.
1234-1254: Consider tracking rollback success for better operational visibility.The rollback loop logs individual node failures but doesn't track overall rollback success. If rollback partially fails, operators lack visibility into the inconsistent controller/craned state.
Apply this diff to track rollback outcomes:
if (has_failure.load(std::memory_order_relaxed)) { + bool rollback_failed = false; for (const auto& craned_id : resumed_nodes) { auto stub = g_craned_keeper->GetCranedStub(craned_id); if (!stub || stub->Invalid()) { CRANE_ERROR( "Failed to rollback resumed state on craned {}: stub unavailable", craned_id); + rollback_failed = true; continue; } CraneErrCode err = stub->SuspendSteps(job_ids); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR( "Rollback SuspendSteps for task #{} on craned {} failed: {}", task_id, craned_id, CraneErrStr(err)); + rollback_failed = true; } } + if (rollback_failed) { + CRANE_ERROR( + "Task #{} is in inconsistent state after failed rollback. " + "Manual intervention may be required.", task_id); + } results.emplace_back(failure_code); continue; }Based on learnings from past review comments.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
protos/Supervisor.proto(2 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Core/JobManager.h(2 hunks)src/Craned/Core/SupervisorKeeper.cpp(2 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(2 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(7 hunks)src/Craned/Supervisor/TaskManager.h(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Craned/Core/JobManager.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
🧬 Code graph analysis (8)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (2)
SuspendJob(86-100)SuspendJob(86-86)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(89-103)SuspendJob(89-92)ResumeTask(105-119)ResumeTask(105-108)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendSteps(241-268)SuspendSteps(241-241)ResumeSteps(270-297)ResumeSteps(270-270)src/Craned/Core/CranedServer.cpp (4)
SuspendSteps(88-113)SuspendSteps(88-91)ResumeSteps(115-140)ResumeSteps(115-118)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
SuspendSteps(88-113)SuspendSteps(88-91)ResumeSteps(115-140)ResumeSteps(115-118)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(103-106)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
SuspendJobAsync(1704-1711)SuspendJobAsync(1704-1704)ResumeJobAsync(1713-1720)ResumeJobAsync(1713-1713)EvCleanTaskSignalQueueCb_(1961-1990)EvCleanTaskSignalQueueCb_(1961-1961)SuspendRunningTasks_(1992-2007)SuspendRunningTasks_(1992-1992)ResumeSuspendedTasks_(2009-2024)ResumeSuspendedTasks_(2009-2009)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SuspendRunningTasks(999-1142)SuspendRunningTasks(999-1000)ResumeSuspendedTasks(1144-1279)ResumeSuspendedTasks(1144-1145)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(86-100)SuspendJob(86-86)ResumeJob(102-116)ResumeJob(102-102)src/Craned/Supervisor/SupervisorServer.cpp (2)
SuspendJob(89-103)SuspendJob(89-92)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (2)
SuspendJob(89-103)SuspendJob(89-92)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (2)
SuspendJob(86-100)SuspendJob(86-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (10)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
241-297: LGTM: Implementation follows established RPC patterns.Both
SuspendStepsandResumeStepscorrectly follow the existing RPC invocation patterns in this file:
- Proper deadline configuration
- Consistent error handling via
HandleGrpcErrorCode_- Appropriate logging for both success and failure cases
- Correct use of
UpdateLastActiveTime()src/Craned/Core/SupervisorKeeper.h (1)
38-39: LGTM: Method names align with Job-level operations.The naming
SuspendJobandResumeJobcorrectly indicates these methods operate on Job objects, consistent with the PR comment. Since the Supervisor manages a single task per instance (as per learnings), not requiring task parameters is appropriate.Based on learnings.
src/Craned/Core/SupervisorKeeper.cpp (1)
86-116: LGTM: RPC wrappers follow established patterns.Both
SuspendJobandResumeJob:
- Follow the existing RPC invocation pattern
- Include appropriate error logging
- Map specific error reasons to error codes (ERR_INVALID_PARAM, ERR_NON_EXISTENT)
- Return generic failure for unexpected errors
The implementation is consistent with other RPC wrappers in this file.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
588-633: LGTM: Comprehensive error handling for Suspend/Resume operations.The implementation correctly:
- Collects task IDs and invokes batch TaskScheduler operations
- Provides distinct error messages for different failure scenarios:
- ERR_NON_EXISTENT: "Task not found in running queue"
- ERR_INVALID_PARAM: "Task is not running/suspended"
- Generic failures: "Failed to suspend/resume task"
- Follows the established pattern for other ModifyTask attributes
The error handling clarity will help users diagnose issues quickly.
protos/Supervisor.proto (1)
122-147: Verify proto consistency with implementation.The proto defines
SuspendJobandResumeTaskRPCs. Ensure these names align with:
- The method implementations in SupervisorServer.cpp (which currently mirror this inconsistency)
- The intended semantics per PR comments (Job-level operations)
Consider standardizing to either "Job" or "Task" terminology throughout.
src/CraneCtld/TaskScheduler.h (1)
488-491: LGTM! Method declarations are consistent with implementation.The suspend/resume method signatures correctly match their implementations in TaskScheduler.cpp, both returning
std::vector<CraneErrCode>to provide per-task error reporting.src/Craned/Supervisor/TaskManager.h (1)
56-56: LGTM! Suspend/resume infrastructure follows existing patterns.The additions are well-integrated:
- The
statusfield (line 56) replaces a boolean flag with a more expressive enum- Public async API (lines 359-361) matches existing async method conventions
- Private signal queue infrastructure (lines 384-387, 421-422) mirrors other concurrent queue handlers in the class
Also applies to: 359-361, 384-387, 397-397, 402-403, 421-422
src/Craned/Supervisor/TaskManager.cpp (3)
25-25: LGTM! Container suspend handling and infrastructure setup.The changes correctly implement non-destructive suspend/resume for containers:
- Line 25:
signal.himport needed for SIGSTOP/SIGCONT constants- Lines 1024-1032: Short-circuits container kill/delete workflow for suspend/resume signals, sending only the raw signal to the process group
- Lines 1541-1545: Async handle setup follows the existing pattern for other queue handlers
- Line 1607: Status field assignment replaces the deprecated boolean flag
Also applies to: 1024-1032, 1541-1545, 1607-1607
1704-1720: LGTM! Async suspend/resume API.The public async methods correctly enqueue signal elements and return futures for the results, following the established pattern used by other async operations in this class.
1961-1990: LGTM! Signal queue handler with proper state guards.The queue callback correctly validates current status before attempting suspend/resume operations and only updates
m_step_.statuswhen the operation succeeds.
| CraneErrCode SuspendSteps(const std::vector<job_id_t> &job_ids); | ||
|
|
||
| CraneErrCode ResumeSteps(const std::vector<job_id_t> &job_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Clarify naming: method names suggest "Steps" but parameters suggest "Jobs".
The method names SuspendSteps and ResumeSteps suggest they operate on steps, but the parameter type std::vector<job_id_t> suggests they operate on jobs. Based on the PR comment stating "The two operations (suspend and resume) should only support operations on Job objects," consider renaming these methods to SuspendJobs and ResumeJobs for consistency.
Additionally, note the inconsistency with TerminateSteps at line 93 which uses std::vector<task_id_t>. If job_id_t and task_id_t are distinct types, clarify the intended semantics.
🏁 Script executed:
#!/bin/bash
# Check the type definitions and usage patterns for job_id_t, task_id_t, and step_id_t
rg -n "using (job_id_t|task_id_t|step_id_t)" --type=cpp --type=hLength of output: 259
🏁 Script executed:
#!/bin/bash
rg -nP '(SuspendSteps|ResumeSteps|TerminateSteps)' -C3 src/CraneCtld/RpcService/CranedKeeper.hLength of output: 428
Rename methods to match parameter types
- SuspendSteps/ResumeSteps take
vector<job_id_t>→ rename toSuspendJobs/ResumeJobs. - TerminateSteps takes
vector<task_id_t>→ rename toTerminateTasks(and likewiseTerminateOrphanedSteps→TerminateOrphanedTasks).
src/CraneCtld/TaskScheduler.cpp
Outdated
| auto iter = m_running_task_map_.find(task_id); | ||
| if (iter == m_running_task_map_.end()) { | ||
| CRANE_TRACE("Task #{} not in Rn queue for suspend", task_id); | ||
| results.emplace_back(CraneErrCode::ERR_NON_EXISTENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERR_INVALID_PARAM
| offline_nodes.reserve(executing_nodes.size()); | ||
| for (const auto& craned_id : executing_nodes) { | ||
| if (!g_meta_container->CheckCranedOnline(craned_id)) | ||
| offline_nodes.push_back(craned_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以直接结束,没必要继续查
src/CraneCtld/TaskScheduler.cpp
Outdated
| // Ensure suspension only proceeds when every craned involved is online. | ||
| CRANE_WARN("Task #{} suspend skipped because craned(s) {} are offline.", | ||
| task_id, absl::StrJoin(offline_nodes, ",")); | ||
| results.emplace_back(CraneErrCode::ERR_NO_AVAIL_NODE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERR_NO_AVAIL_NODE不是这个意思
src/CraneCtld/TaskScheduler.cpp
Outdated
| continue; | ||
| } | ||
|
|
||
| std::vector<job_id_t> job_ids{static_cast<job_id_t>(task_id)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该不用cast?这两个类型都是using uint32
|
|
||
| CraneErrCode err = stub->SuspendSteps(job_ids); | ||
| if (err != CraneErrCode::SUCCESS) { | ||
| CRANE_ERROR("Failed to suspend task #{} on craned {}: {}", task_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果部分节点Suspend了,这里Resume了一部分但是现在一部分是Running,一部分节点Down Suspended,节点再上线的时候怎么恢复?
| break; | ||
| } | ||
| result = SuspendRunningTasks_(); | ||
| if (result == CraneErrCode::SUCCESS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加个检查吧, m_step_.status首先应该是Running
| } | ||
| result = ResumeSuspendedTasks_(); | ||
| if (result == CraneErrCode::SUCCESS) | ||
| m_step_.status = crane::grpc::TaskStatus::Running; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也加检查
| if (!task) continue; | ||
| if (task->GetPid() == 0) continue; | ||
| any_signal_sent = true; | ||
| CraneErrCode err = task->Kill(SIGSTOP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里改一下,给TaskInstance加Suspend/Resume的函数,要支持容器
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/CraneCtld/TaskScheduler.cpp (1)
1238-1254: Track rollback outcome and log inconsistent state.The rollback loop issues
SuspendStepsto revert resumed nodes but doesn't track whether those calls succeed. If rollback fails, controller and craned states diverge—some nodes remain running while Ctld thinks the operation failed. This complicates recovery and diagnostics.Based on learnings from past review comments.
Apply this diff to track rollback failures:
if (has_failure.load(std::memory_order_relaxed)) { + bool rollback_failed = false; for (const auto& craned_id : resumed_nodes) { auto stub = g_craned_keeper->GetCranedStub(craned_id); if (!stub || stub->Invalid()) { CRANE_ERROR( "Failed to rollback resumed state on craned {}: stub unavailable", craned_id); + rollback_failed = true; continue; } CraneErrCode err = stub->SuspendSteps(job_ids); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR( "Rollback SuspendSteps for task #{} on craned {} failed: {}", task_id, craned_id, CraneErrStr(err)); + rollback_failed = true; } } + if (rollback_failed) { + CRANE_ERROR( + "Task #{} is in inconsistent state after failed rollback. " + "Manual intervention may be required.", task_id); + } results.emplace_back(failure_code); continue; }</review_comment_end>
src/Craned/Supervisor/TaskManager.cpp (1)
2000-2032: Do not short-circuit on first error; continue signaling and aggregate — prevents stranding partial setsCurrently, SuspendRunningTasks_/ResumeSuspendedTasks_ return on first failure, potentially leaving earlier tasks paused/resumed while the step status remains unchanged. Continue signaling every task, collect the first error, and only flip state when ALL succeed. This was previously flagged and appears reintroduced.
Apply these diffs:
@@ CraneErrCode TaskManager::SuspendRunningTasks_() { if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT; - bool any_signal_sent = false; + bool any_signal_sent = false; + CraneErrCode first_err = CraneErrCode::SUCCESS; for (const auto& [pid, task_id] : m_pid_task_id_map_) { auto task = m_step_.GetTaskInstance(task_id); if (!task) continue; if (task->GetPid() == 0) continue; any_signal_sent = true; CraneErrCode err = task->Kill(SIGSTOP); - if (err != CraneErrCode::SUCCESS) return err; + if (err != CraneErrCode::SUCCESS && first_err == CraneErrCode::SUCCESS) + first_err = err; } - if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM; - return CraneErrCode::SUCCESS; + if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM; + return first_err; } @@ CraneErrCode TaskManager::ResumeSuspendedTasks_() { if (m_step_.AllTaskFinished()) return CraneErrCode::ERR_NON_EXISTENT; - bool any_signal_sent = false; + bool any_signal_sent = false; + CraneErrCode first_err = CraneErrCode::SUCCESS; for (const auto& [pid, task_id] : m_pid_task_id_map_) { auto task = m_step_.GetTaskInstance(task_id); if (!task) continue; if (task->GetPid() == 0) continue; any_signal_sent = true; CraneErrCode err = task->Kill(SIGCONT); - if (err != CraneErrCode::SUCCESS) return err; + if (err != CraneErrCode::SUCCESS && first_err == CraneErrCode::SUCCESS) + first_err = err; } - if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM; - return CraneErrCode::SUCCESS; + if (!any_signal_sent) return CraneErrCode::ERR_INVALID_PARAM; + return first_err; }Optional: add explicit ITaskInstance::Suspend()/Resume() wrappers and call those here to encapsulate container-specific behavior instead of passing raw signals.
- CraneErrCode err = task->Kill(SIGSTOP); + CraneErrCode err = task->Suspend(); @@ - CraneErrCode err = task->Kill(SIGCONT); + CraneErrCode err = task->Resume()I can draft the wrapper implementations if desired.
🧹 Nitpick comments (5)
src/CraneCtld/TaskScheduler.cpp (4)
1037-1044: Minor: Offline check breaks early but reserves space for all nodes.The loop breaks after finding the first offline node (line 1042), so
offline_nodeswill contain at most one node, yet space is reserved for allexecuting_nodes.size(). While this works correctly (any offline node is enough to fail), consider either removing thebreakto collect all offline nodes for better logging, or removing thereservesince only one node is collected.</review_comment_end>
1129-1138: Simplify redundant status check.The condition
if (task->Status() != crane::grpc::Suspended)at line 1129 should always be true at this point—the task was verified as Running earlier (lines 1021-1026) and no code path between changes it to Suspended. The check adds no value and can be simplified.Apply this diff:
TaskInCtld* task = iter->second.get(); - if (task->Status() != crane::grpc::Suspended) { - task->SetStatus(crane::grpc::Suspended); - if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists( - 0, task->TaskDbId(), task->RuntimeAttr())) { - CRANE_ERROR("Failed to persist suspended status for task #{}", - task_id); - results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE); - continue; - } + task->SetStatus(crane::grpc::Suspended); + if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists( + 0, task->TaskDbId(), task->RuntimeAttr())) { + CRANE_ERROR("Failed to persist suspended status for task #{}", + task_id); + results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE); + continue; }</review_comment_end>
1179-1186: Minor: Offline check breaks early but reserves space for all nodes.Same pattern as
SuspendRunningTasks: the loop breaks after the first offline node, sooffline_nodeswill contain at most one node, yet space is reserved for all. Consider either removing thebreakto collect all offline nodes for comprehensive logging, or removing thereserve.</review_comment_end>
1269-1277: Simplify redundant status check.The condition
if (task->Status() != crane::grpc::Running)at line 1269 should always be true—the task was verified as Suspended earlier (lines 1163-1168) and no intervening code changes it to Running. This check is redundant and can be removed.Apply this diff:
TaskInCtld* task = iter->second.get(); - if (task->Status() != crane::grpc::Running) { - task->SetStatus(crane::grpc::Running); - if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists( - 0, task->TaskDbId(), task->RuntimeAttr())) { - CRANE_ERROR("Failed to persist resumed status for task #{}", task_id); - results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE); - continue; - } + task->SetStatus(crane::grpc::Running); + if (!g_embedded_db_client->UpdateRuntimeAttrOfTaskIfExists( + 0, task->TaskDbId(), task->RuntimeAttr())) { + CRANE_ERROR("Failed to persist resumed status for task #{}", task_id); + results.emplace_back(CraneErrCode::ERR_GENERIC_FAILURE); + continue; }</review_comment_end>
src/Craned/Supervisor/TaskManager.cpp (1)
1961-1998: Suspend/Resume orchestration: confirm external state propagation and timer semantics
- No status change is reported upstream here. If Ctld/Craned rely on supervisor notifications, wire a StepStatusChange (Suspended/Running) or confirm higher layers already track these transitions.
- Time-limit timer continues while Suspended. If product semantics require pausing wall-time while suspended, pause/adjust the termination timer when entering Suspended and restore on Resume. Otherwise, document that wall-time continues.
Would you like a small patch to optionally pause the termination timer during Suspended?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Supervisor/TaskManager.cpp(7 hunks)
🔇 Additional comments (7)
src/Craned/Supervisor/TaskManager.cpp (7)
25-25: Header include for signals — LGTMRequired for SIGSTOP/SIGCONT usage.
1541-1546: Async handle for task signal queue — LGTMPlumbing is consistent with existing uvw async handles.
1607-1607: State assignment on cleanup — LGTMSetting m_step_.status here aligns with one-shot cleanup and avoids clearing suspend state elsewhere.
1704-1711: SuspendJobAsync queue plumbing — LGTMFuture-based API and async signaling are appropriate.
1713-1720: ResumeJobAsync queue plumbing — LGTMSymmetric to suspend; looks good.
1907-1907: Resetting status to Running before termination — explain intentThis may surprise future readers and upstream expectations. Add a brief comment explaining why we force Running (e.g., to pass gating in suspend/resume paths) and confirm no premature status propagation to Ctld occurs here.
1024-1032: Non-destructive suspend/resume for containers — verified runtime mode Default runc run command in etc/config.yaml doesn’t include--detach/--detached, so SIGSTOP/SIGCONT via kill(-m_pid_, …) will correctly pause and resume containers.
17cb1ae to
cb1f358
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Craned/Core/JobManager.h (1)
128-131: These methods only suspend/resume the first step, not all steps of a job.Per the PR comment from L-Xiafeng stating "The two operations (suspend and resume) should only support operations on Job objects," and as previously flagged, the current implementation uses
step_map.begin()->second.get()which arbitrarily operates on only the first step of a job. To properly support job-level suspend/resume, these methods should iterate over all steps in the job'sstep_mapand suspend/resume each one.Consider renaming to
SuspendJob/ResumeJobto clarify the job-scoped semantics, and update the implementation to iterate all steps:- CraneErrCode SuspendStep(task_id_t task_id); + CraneErrCode SuspendJob(job_id_t job_id); - CraneErrCode ResumeStep(task_id_t task_id); + CraneErrCode ResumeJob(job_id_t job_id);Then in
JobManager.cpp, iterate over all steps:CraneErrCode JobManager::SuspendJob(job_id_t job_id) { auto job_instance = m_job_map_.GetValueExclusivePtr(job_id); if (!job_instance || job_instance->step_map.empty()) { CRANE_DEBUG("Suspend request for nonexistent job #{}", job_id); return CraneErrCode::ERR_NON_EXISTENT; } CraneErrCode result = CraneErrCode::SUCCESS; for (auto& [step_id, step_inst] : job_instance->step_map) { if (step_inst->status == StepStatus::Suspended) { CRANE_TRACE("Step #{} of job #{} already suspended", step_id, job_id); continue; } auto stub = g_supervisor_keeper->GetStub(step_id); if (!stub) { CRANE_ERROR("Supervisor for step #{} of job #{} not found when suspending", step_id, job_id); result = CraneErrCode::ERR_RPC_FAILURE; continue; } CraneErrCode err = stub->SuspendJob(); if (err == CraneErrCode::SUCCESS) { step_inst->status = StepStatus::Suspended; } else { result = err; } } return result; }
🧹 Nitpick comments (2)
src/CraneCtld/TaskScheduler.cpp (2)
1316-1316: Consider usingm_rpc_worker_pool_for consistency with other RPC broadcasts.The suspend RPC broadcast uses
g_thread_pool, but other RPC operations in this file (AllocJobs,AllocSteps,FreeJobs,FreeSteps) consistently usem_rpc_worker_pool_. For consistency and to follow the established pattern where RPC-heavy operations use the dedicated RPC worker pool, consider switching tom_rpc_worker_pool_->detach_task.Based on learnings from past review comments and coding guidelines.
- g_thread_pool->detach_task([&, craned_id]() { + m_rpc_worker_pool_->detach_task([&, craned_id]() {
1457-1457: Consider usingm_rpc_worker_pool_for consistency with other RPC broadcasts.Same issue as in
SuspendRunningTasks: the resume RPC broadcast usesg_thread_pool, but should usem_rpc_worker_pool_for consistency with other RPC operations in this file.- g_thread_pool->detach_task([&, craned_id]() { + m_rpc_worker_pool_->detach_task([&, craned_id]() {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Core/JobManager.cpp(1 hunks)src/Craned/Core/JobManager.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Craned/Core/JobManager.cpp
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
🧬 Code graph analysis (2)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(843-851)
src/Craned/Core/JobManager.h (1)
src/Craned/Core/JobManager.cpp (4)
SuspendStep(1180-1202)SuspendStep(1180-1180)ResumeStep(1204-1226)ResumeStep(1204-1204)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (1)
src/Craned/Core/JobManager.h (1)
128-128: Parameter type inconsistency and semantic confusion.The header declares the parameter as
task_id_t task_id, but the implementation inJobManager.cppusesstep_id_t step_id. Additionally, the parameter is used to look up inm_job_map_, which is keyed byjob_id_t(line 161). This creates confusion about whether the parameter represents a task ID, step ID, or job ID.Ensure consistency between the header and implementation, and clarify the parameter's semantics by using the appropriate type (
job_id_tif operating on jobs).⛔ Skipped due to learnings
Learnt from: L-Xiafeng Repo: PKUHPC/CraneSched PR: 399 File: src/Craned/Craned/SupervisorKeeper.h:32-32 Timestamp: 2025-05-26T11:04:30.580Z Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.Learnt from: huerni Repo: PKUHPC/CraneSched PR: 469 File: src/Craned/TaskManager.cpp:1162-1175 Timestamp: 2025-05-08T07:39:29.207Z Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.Learnt from: huerni Repo: PKUHPC/CraneSched PR: 469 File: src/Craned/TaskManager.cpp:1162-1175 Timestamp: 2025-05-08T07:39:29.207Z Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/Craned/Core/CranedServer.cpp (2)
88-122: LGTM: SuspendJobs handler correctly implements job-level suspend.The implementation correctly:
- Validates server readiness
- Uses job-level
GetStub(job_id)for job-only operations (per PR objectives)- Accumulates per-job errors and returns aggregated status
- Returns
Status::OKwith error details in the response payload (per learnings)Consider adding
CRANE_WARNlogging when individual jobs fail to suspend (lines 104-106, 110-112), similar to how success is logged at line 114. This would aid debugging when operations fail:if (!stub) { all_ok = false; reasons.emplace_back(fmt::format("job {}: Supervisor not found", job_id)); + CRANE_WARN("Job {}: Supervisor not found", job_id); continue; } CraneErrCode err = stub->SuspendJob(); if (err != CraneErrCode::SUCCESS) { all_ok = false; reasons.emplace_back(fmt::format("job {}: {}", job_id, CraneErrStr(err))); + CRANE_WARN("Job {} suspend failed: {}", job_id, CraneErrStr(err)); }
124-157: LGTM: ResumeJobs handler correctly implements job-level resume.The implementation correctly mirrors the
SuspendJobshandler structure and follows the same patterns for readiness checks, error accumulation, and response handling.Similar to
SuspendJobs, consider addingCRANE_WARNlogging for individual job failures (lines 138-142, 144-147):if (!stub) { all_ok = false; reasons.emplace_back(fmt::format("job {}: Supervisor not found", job_id)); + CRANE_WARN("Job {}: Supervisor not found", job_id); continue; } CraneErrCode err = stub->ResumeJob(); if (err != CraneErrCode::SUCCESS) { all_ok = false; reasons.emplace_back(fmt::format("job {}: {}", job_id, CraneErrStr(err))); + CRANE_WARN("Job {} resume failed: {}", job_id, CraneErrStr(err)); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/Craned/Core/CranedServer.cpp(14 hunks)src/Craned/Core/JobManager.h(0 hunks)src/Craned/Supervisor/TaskManager.cpp(5 hunks)
💤 Files with no reviewable changes (1)
- src/Craned/Core/JobManager.h
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/Craned/Core/CranedServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (2)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
🔇 Additional comments (3)
src/Craned/Core/CranedServer.cpp (3)
33-35: LGTM: Signature changes follow gRPC conventions.The refactoring of all RPC handler signatures from reference parameters to pointer parameters aligns with standard gRPC conventions and improves consistency across the codebase.
Also applies to: 43-45, 63-66, 160-162, 184-187, 294-296, 316-318, 327-329, 346-348, 363-366, 388-391, 414-417
52-54: LGTM: Map-based iteration improves batching efficiency.The refactoring from individual item lists to
job_step_ids_mapimproves efficiency for batched operations and maintains clear job-to-steps relationships. The iteration pattern is consistent across all affected handlers.Also applies to: 74-76, 169-171, 336-338
222-222: LGTM: const-correctness improvements.The addition of
constqualifiers in range-based for loops and the constructor signature improves const-correctness without changing behavior.Also applies to: 230-230, 235-235, 380-380, 430-430
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/CraneCtld/RpcService/CranedKeeper.cpp(27 hunks)src/CraneCtld/RpcService/CranedKeeper.h(7 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Core/CranedServer.cpp(14 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cppsrc/Craned/Core/CranedServer.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cppsrc/Craned/Core/CranedServer.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cppsrc/Craned/Core/CranedServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:47:44.485Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Utilities/PublicHeader/Logger.cpp:21-21
Timestamp: 2025-06-23T07:47:44.485Z
Learning: In the CraneSched codebase, static logger sinks like default_sinks in src/Utilities/PublicHeader/Logger.cpp are initialized during startup in a single-threaded context and then become read-only during normal operation, eliminating the need for synchronization protection.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (4)
src/CraneCtld/TaskScheduler.cpp (4)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(843-851)src/Craned/Core/JobManager.cpp (2)
StepStatusChangeAsync(1246-1255)StepStatusChangeAsync(1246-1250)src/Craned/Core/CtldClient.cpp (2)
StepStatusChangeAsync(564-573)StepStatusChangeAsync(564-565)src/Craned/Supervisor/CranedClient.cpp (2)
StepStatusChangeAsync(46-56)StepStatusChangeAsync(46-48)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
CranedStub(28-34)Fini(36-38)Fini(36-36)l(403-403)ConfigureCraned(40-67)ConfigureCraned(40-41)AllocJobs(134-161)AllocJobs(134-135)FreeJobs(163-187)FreeJobs(163-163)AllocSteps(189-207)AllocSteps(189-190)job_id_t(208-242)ExecuteSteps(209-210)FreeSteps(244-268)FreeSteps(244-245)SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)TerminateSteps(69-98)TerminateSteps(69-70)TerminateOrphanedSteps(100-132)TerminateOrphanedSteps(100-101)CranedKeeper(365-386)CranedKeeper(388-395)Shutdown(397-417)Shutdown(397-397)AvailableCranedCount(720-723)AvailableCranedCount(720-720)IsCranedConnected(725-728)IsCranedConnected(725-725)GetCranedStub(730-737)GetCranedStub(730-731)SetCranedConnectedCb(739-742)SetCranedConnectedCb(739-740)SetCranedDisconnectedCb(744-746)SetCranedDisconnectedCb(744-744)PutNodeIntoUnavailSet(748-756)PutNodeIntoUnavailSet(748-749)ConnectCranedNode_(758-846)ConnectCranedNode_(758-759)CranedChannelConnFailNoLock_(848-854)CranedChannelConnFailNoLock_(848-848)InitCranedStateMachine_(525-622)InitCranedStateMachine_(525-526)EstablishedCranedStateMachine_(624-691)EstablishedCranedStateMachine_(624-625)CheckNodeTimeoutAndClean(693-718)CheckNodeTimeoutAndClean(693-693)StateMonitorThreadFunc_(419-523)StateMonitorThreadFunc_(419-419)PeriodConnectCranedThreadFunc_(856-892)PeriodConnectCranedThreadFunc_(856-856)
src/CraneCtld/RpcService/CranedKeeper.cpp (3)
src/Craned/Core/CranedServer.cpp (4)
SuspendJobs(88-138)SuspendJobs(88-91)ResumeJobs(140-189)ResumeJobs(140-142)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(102-105)src/Craned/Core/CtldClient.h (1)
UpdateLastActiveTime(193-195)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
src/CraneCtld/TaskScheduler.cpp
Outdated
| StepStatusChangeAsync(task_id, 0, craned_id, | ||
| crane::grpc::TaskStatus::Failed, | ||
| ExitCode::kExitCodeRpcError, "", absl::Now()); | ||
| } | ||
|
|
||
| results.emplace_back(code); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix bogus step_id in failure cleanup.
Both suspend/resume failure paths call StepStatusChangeAsync(task_id, 0, …); step id 0 doesn’t correspond to any real step, so CleanTaskStatusChangeQueueCb_ drops the notification as “unknown step”. When a craned is offline, that means the controller never marks the job failed even though we already tore it down, leaving m_running_task_map_ inconsistent. Use the actual daemon step id (e.g. kDaemonStepId) or iterate the real executing step IDs when fanning out the status change so the failure path reliably clears the job state.
Also applies to: 1434-1439
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1292-1297 (and similarly
1434-1439), the failure paths call StepStatusChangeAsync(task_id, 0, ...) using
step id 0 which is not a real step and gets dropped by
CleanTaskStatusChangeQueueCb_; replace the hardcoded 0 with the actual daemon
step id (e.g. kDaemonStepId) or, if fanning out to multiple executing steps,
iterate the real executing step IDs and call StepStatusChangeAsync for each so
the controller correctly recognizes and clears the task state; apply the same
fix at both locations and ensure the StepStatusChangeAsync arguments match what
CleanTaskStatusChangeQueueCb_ expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
433-436: Free CqTag on early return when m_cq_closed_.You
continuewithout deleting the pmr-allocated tag, leaking the tag object. Delete the tag before continuing.- if (m_cq_closed_) { - if (tag->type == CqTag::kInitializingCraned) delete tag->craned; - continue; - } + if (m_cq_closed_) { + if (tag->type == CqTag::kInitializingCraned) delete tag->craned; + m_tag_sync_allocator_->delete_object(tag); + continue; + }
♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)
1291-1296: Fix bogus step_id in failure fanout—use kDaemonStepId, not 0.StepStatusChangeAsync with step_id 0 is ignored downstream, so failed suspend/resume never clears Ctld state. Use the daemon step id for the job when broadcasting failure.
Apply this diff at both call sites:
- for (const auto& craned_id : executing_nodes) { - StepStatusChangeAsync( - task_id, 0, craned_id, crane::grpc::TaskStatus::Failed, - ExitCode::kExitCodeRpcError, "", - google::protobuf::util::TimeUtil::GetCurrentTime()); - } + for (const auto& craned_id : executing_nodes) { + StepStatusChangeAsync( + task_id, kDaemonStepId, craned_id, crane::grpc::TaskStatus::Failed, + ExitCode::kExitCodeRpcError, "Suspend/Resume RPC failure", + google::protobuf::util::TimeUtil::GetCurrentTime()); + }Also applies to: 1435-1439
🧹 Nitpick comments (4)
src/CraneCtld/TaskScheduler.cpp (1)
1315-1343: Unify RPC worker pool for broadcast.Elsewhere this class uses m_rpc_worker_pool_ for node fanouts; here it uses g_thread_pool. For consistency and easier load control, prefer m_rpc_worker_pool_->detach_task.
- for (const auto& craned_id : executing_nodes) { - g_thread_pool->detach_task([&, craned_id]() { + for (const auto& craned_id : executing_nodes) { + m_rpc_worker_pool_->detach_task([&, craned_id]() { ... }); }Also applies to: 1458-1485
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
269-297: Optional: align server error semantics to avoid non-OK gRPC status disconnects.Ctld handles non-OK as RPC failure. If feasible, make Craned return Status::OK with reply.ok=false (see repo practice) to avoid needless channel disruptions. Code here can remain unchanged.
Based on learnings
Also applies to: 299-326
src/CraneCtld/RpcService/CranedKeeper.h (2)
194-197: Remove unused/undefined declarationCranedChannelConnectFail_.Only
CranedChannelConnFailNoLock_is defined/used. Drop the unused declaration to avoid confusion.- static void CranedChannelConnectFail_(CranedStub* stub); - static void CranedChannelConnFailNoLock_(CranedStub* stub);
74-77: Name/type clarity:FreeJobstakes task_id_t but fills job_id_list.Types are aliases, but parameter naming (
task) conflicts with proto fieldjob_id_list(). Consider renaming tojob_idsfor readability or switch tostd::vector<job_id_t>.- CraneErrCode FreeJobs(const std::vector<task_id_t>& task); + CraneErrCode FreeJobs(const std::vector<job_id_t>& job_ids);And in the .cpp use
job_idswhen assigning torequest.mutable_job_id_list().Also applies to: 163-176
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/RpcService/CranedKeeper.cpp(27 hunks)src/CraneCtld/RpcService/CranedKeeper.h(7 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (3)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(843-851)src/Craned/Core/JobManager.cpp (2)
StepStatusChangeAsync(1246-1255)StepStatusChangeAsync(1246-1250)src/Craned/Core/CtldClient.cpp (2)
StepStatusChangeAsync(564-573)StepStatusChangeAsync(564-565)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
CranedStub(28-34)Fini(36-38)Fini(36-36)l(403-403)ConfigureCraned(40-67)ConfigureCraned(40-41)AllocJobs(134-161)AllocJobs(134-135)FreeJobs(163-187)FreeJobs(163-163)AllocSteps(189-207)AllocSteps(189-190)job_id_t(208-242)ExecuteSteps(209-210)FreeSteps(244-268)FreeSteps(244-245)SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)TerminateSteps(69-98)TerminateSteps(69-70)TerminateOrphanedSteps(100-132)TerminateOrphanedSteps(100-101)CranedKeeper(365-386)CranedKeeper(388-395)Shutdown(397-417)Shutdown(397-397)AvailableCranedCount(720-723)AvailableCranedCount(720-720)IsCranedConnected(725-728)IsCranedConnected(725-725)GetCranedStub(730-737)GetCranedStub(730-731)SetCranedConnectedCb(739-742)SetCranedConnectedCb(739-740)SetCranedDisconnectedCb(744-746)SetCranedDisconnectedCb(744-744)PutNodeIntoUnavailSet(748-756)PutNodeIntoUnavailSet(748-749)CranedChannelConnFailNoLock_(848-854)CranedChannelConnFailNoLock_(848-848)ConnectCranedNode_(758-846)ConnectCranedNode_(758-759)InitCranedStateMachine_(525-622)InitCranedStateMachine_(525-526)EstablishedCranedStateMachine_(624-691)EstablishedCranedStateMachine_(624-625)CheckNodeTimeoutAndClean(693-718)CheckNodeTimeoutAndClean(693-693)StateMonitorThreadFunc_(419-523)StateMonitorThreadFunc_(419-419)PeriodConnectCranedThreadFunc_(856-892)PeriodConnectCranedThreadFunc_(856-856)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
SuspendJobs(88-138)SuspendJobs(88-91)ResumeJobs(140-189)ResumeJobs(140-142)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(102-105)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
2154-2207: Consider preserving the first error instead of the last.The error aggregation logic (lines 2172, 2199) overwrites
resulteach time a task operation fails. If multiple tasks fail, only the last error is returned, losing information about earlier failures.Apply this pattern to preserve the first error encountered:
CraneErrCode err = task->Suspend(); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR("Failed to suspend task #{}: {}", task_id, CraneErrStr(err)); - result = err; + if (result == CraneErrCode::SUCCESS) result = err; } else {Apply the same change in
ResumeSuspendedTasks_()at line 2199.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Supervisor/TaskManager.cpp(6 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
🔇 Additional comments (5)
src/Craned/Supervisor/TaskManager.cpp (5)
113-125: LGTM! GetTaskInstance now safely handles missing tasks.Changing from
at()tofind()with a nullptr return properly addresses the previous issue where iteratingm_step_.task_idscould throwstd::out_of_rangefor tasks not yet instantiated or already cleaned up.
608-636: LGTM! Process group signaling is correct.The suspend/resume implementations correctly use negative PID (
-m_pid_) to signal the entire process group, which ensures child processes are also suspended/resumed. Error handling is appropriate.
1104-1225: LGTM! Container suspend/resume with appropriate fallback.The implementation correctly:
- Checks container state before applying OCI runtime pause/resume commands
- Falls back to process group signaling when runtime commands are unavailable or container is in an unexpected state
- Mirrors the defensive pattern used in the
Kill()method
1835-1861: LGTM! Async API follows established patterns.The suspend/resume async methods correctly follow the same pattern as other async operations in TaskManager (enqueue element with promise, send async event, return future).
1662-1666: LGTM! Async handle setup and callback are correct.The task signal async handle is properly initialized in the constructor and the callback correctly:
- Dequeues signal requests
- Dispatches to appropriate helper method based on action type
- Fulfills the promise with the operation result
Also applies to: 2136-2152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
2163-2177: Error aggregation returns the last error instead of the first.The current implementation overwrites
resultfor each failed task, meaning the returned error code corresponds to the last failure rather than the first. While continuing to process all tasks is correct (to ensure partial success), returning the first error is typically more useful for debugging.Consider tracking the first error:
CraneErrCode result = CraneErrCode::SUCCESS; for (const auto& task_id : m_step_.task_ids) { auto* task = m_step_.GetTaskInstance(task_id); if (task) { CraneErrCode err = task->Suspend(); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR("Failed to suspend task #{}: {}", task_id, CraneErrStr(err)); - result = err; + if (result == CraneErrCode::SUCCESS) result = err; } else { CRANE_DEBUG("Task #{} suspended successfully", task_id); } } }Apply the same change to
ResumeSuspendedTasks_()as well.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Supervisor/TaskManager.cpp(6 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
🔇 Additional comments (6)
src/Craned/Supervisor/TaskManager.cpp (6)
112-126: LGTM! Safe lookup pattern.The change from
at()tofind()with nullptr return is correct and addresses the past review concern about crashes when task IDs aren't instantiated. Callers now properly handle missing tasks.
608-636: LGTM! Process group suspend/resume is correct.The implementation correctly uses
kill(-m_pid_, SIGSTOP/SIGCONT)to suspend/resume the entire process group. Error handling and logging are appropriate.
1104-1225: LGTM! Container suspend/resume with fallback.The implementation correctly attempts OCI runtime pause/resume first (if configured), then falls back to process group signaling. The state checking and fallback strategy are sound.
1662-1666: LGTM! Async handle wiring.The new signal queue handle follows the established pattern used by other async handlers in TaskManager.
1835-1861: LGTM! Async API structure.The suspend/resume async APIs follow the established pattern for async operations in TaskManager, correctly using promises/futures and queue-based communication.
2182-2207: Add status check before resuming tasks.Similar to suspend, the resume operation should verify the step state before attempting to resume tasks. The step should be in a suspended or paused state for resume to be valid.
Based on learnings.
Apply this diff to add the status check:
CraneErrCode TaskManager::ResumeSuspendedTasks_() { CRANE_DEBUG("Resuming all suspended tasks"); + if (g_runtime_status.Status.load() != StepStatus::Running) { + CRANE_DEBUG("Step is not in Running state, cannot resume."); + return CraneErrCode::ERR_INVALID_PARAM; + } + // Daemon steps don't have actual task instances to resume if (m_step_.IsDaemon()) { CRANE_DEBUG("Daemon step cannot be resumed, skipping.");⛔ Skipped due to learnings
Learnt from: huerni Repo: PKUHPC/CraneSched PR: 469 File: src/Craned/TaskManager.cpp:1162-1175 Timestamp: 2025-05-08T07:39:29.207Z Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.Learnt from: huerni Repo: PKUHPC/CraneSched PR: 469 File: src/Craned/TaskManager.cpp:1162-1175 Timestamp: 2025-05-08T07:39:29.207Z Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().Learnt from: L-Xiafeng Repo: PKUHPC/CraneSched PR: 399 File: src/Craned/Craned/SupervisorKeeper.h:32-32 Timestamp: 2025-05-26T11:04:30.580Z Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
| CraneErrCode TaskManager::SuspendRunningTasks_() { | ||
| CRANE_DEBUG("Suspending all running tasks"); | ||
|
|
||
| // Daemon steps don't have actual task instances to suspend | ||
| if (m_step_.IsDaemon()) { | ||
| CRANE_DEBUG("Daemon step cannot be suspended, skipping."); | ||
| return CraneErrCode::ERR_GENERIC_FAILURE; | ||
| } | ||
|
|
||
| CraneErrCode result = CraneErrCode::SUCCESS; | ||
|
|
||
| for (const auto& task_id : m_step_.task_ids) { | ||
| auto* task = m_step_.GetTaskInstance(task_id); | ||
| if (task) { | ||
| CraneErrCode err = task->Suspend(); | ||
| if (err != CraneErrCode::SUCCESS) { | ||
| CRANE_ERROR("Failed to suspend task #{}: {}", task_id, | ||
| CraneErrStr(err)); | ||
| result = err; | ||
| } else { | ||
| CRANE_DEBUG("Task #{} suspended successfully", task_id); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add status check before suspending tasks.
The suspend operation should verify that the step is in a Running state before attempting to suspend tasks. Without this check, the method might attempt to suspend tasks that are in Completing, Pending, or other non-running states, leading to unexpected behavior.
Based on learnings.
Apply this diff to add the status check:
CraneErrCode TaskManager::SuspendRunningTasks_() {
CRANE_DEBUG("Suspending all running tasks");
+ if (g_runtime_status.Status.load() != StepStatus::Running) {
+ CRANE_DEBUG("Step is not in Running state, cannot suspend.");
+ return CraneErrCode::ERR_INVALID_PARAM;
+ }
+
// Daemon steps don't have actual task instances to suspend
if (m_step_.IsDaemon()) {
CRANE_DEBUG("Daemon step cannot be suspended, skipping.");Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2154 to 2180, add a guard
that checks the step's status before iterating and suspending tasks: if the step
is not in the Running state (e.g. via m_step_.IsRunning() or comparing
m_step_.status), log a debug/info message and return an appropriate error code
(such as CraneErrCode::ERR_INVALID_STATE) to avoid attempting Suspend on tasks
when the step is Completing/Pending/etc.; keep the existing daemon check and
existing task suspension loop unchanged otherwise.
0f3ca65 to
d881699
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README_EN.md(1 hunks)docs/en/command/ccontrol.md(4 hunks)docs/zh/command/ccontrol.md(5 hunks)protos/Crane.proto(4 hunks)protos/PublicDefs.proto(1 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/SupervisorKeeper.cpp(2 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(2 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- README_EN.md
🚧 Files skipped from review as they are similar to previous changes (5)
- src/Craned/Supervisor/SupervisorServer.h
- protos/Supervisor.proto
- protos/PublicDefs.proto
- src/CraneCtld/TaskScheduler.h
- protos/Crane.proto
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/Craned/Core/SupervisorKeeper.cpp
🧬 Code graph analysis (3)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(93-111)SuspendJob(93-93)ResumeJob(113-131)ResumeJob(113-113)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(95-109)SuspendJob(95-98)ResumeJob(111-125)ResumeJob(111-114)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(93-111)SuspendJob(93-93)ResumeJob(113-131)ResumeJob(113-113)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(95-109)SuspendJob(95-98)ResumeJob(111-125)ResumeJob(111-114)
🪛 LanguageTool
docs/en/command/ccontrol.md
[grammar] ~265-~265: Ensure spelling is correct
Context: ...job_id (required): ID of the job to suspend Description: - Can only suspend jobs ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~268-~268: Ensure spelling is correct
Context: ...n:** - Can only suspend jobs in RUNNING state - After suspension, job status changes to ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~271-~271: Ensure spelling is correct
Context: ... data and state are preserved for later resumption Example: ```bash # Suspend a job ccon...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~289-~289: Ensure spelling is correct
Context: ...job_id (required): ID of the job to resume Description: - Can only resume jobs i...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~292-~292: Ensure spelling is correct
Context: ...:** - Can only resume jobs in SUSPENDED state - After resumption, the job reallocates re...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~293-~293: Ensure spelling is correct
Context: ...job reallocates resources and continues execution - The job continues from where it was susp...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~307-~307: Ensure spelling is correct
Context: ...iting for resources to become available again ### Create and Delete Commands #### Create...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
src/Craned/Core/SupervisorKeeper.h (1)
39-40: LGTM! Method declarations follow established patterns.The new
SuspendJob()andResumeJob()method declarations are consistent with the existing RPC wrapper methods inSupervisorStub. The signatures follow the same pattern (no parameters, returnCraneErrCode) as other control methods likeExecuteStep()andShutdownSupervisor().docs/en/command/ccontrol.md (1)
253-308: Documentation is comprehensive and accurate.The suspend/resume documentation is well-structured and clearly explains:
- Command syntax and parameters
- State requirements (RUNNING for suspend, SUSPENDED for resume)
- Resource behavior (released on suspend, reallocated on resume)
- Key differences from hold/release operations
The static analysis grammar warnings are false positives—the flagged words (suspend, resumption, execution) are correctly spelled technical terms.
Note: As with the Chinese documentation, verify that the TaskManager implementation enforces the documented state restrictions.
src/Craned/Supervisor/SupervisorServer.cpp (2)
23-23: Include is correctly added.The
crane/PublicHeader.hinclude is necessary for theCraneErrStr()function used in the new RPC handlers (lines 106, 122).
95-125: RPC handlers follow established patterns correctly.Both
SuspendJobandResumeJobimplementations follow the established async pattern used throughout the codebase:
- Call the async method on
g_task_mgr- Wait for the future to complete
- Set response fields based on the result
- Return
Status::OK(per project convention to keep gRPC connections alive)The previous naming inconsistency issue (ResumeTask vs ResumeJob) has been resolved—both methods now consistently use the "Job" suffix.
Based on learnings.
src/Craned/Core/SupervisorKeeper.cpp (2)
22-22: Include is correctly added.The
crane/PublicHeader.hinclude is necessary for theCraneErrStr()function used on lines 106, 108, 126, and 128.
93-131: Client stub implementations correctly handle errors.Both
SuspendJob()andResumeJob()implementations properly handle RPC failures:
- Transport errors: Lines 99-102 and 119-122 check
status.ok()first and returnERR_RPC_FAILUREfor gRPC transport failures- Application errors: Lines 105-110 and 125-130 map specific
reply.reason()strings to appropriate error codesThis follows the established pattern in the codebase and addresses the previous review comments about transport error handling.
d881699 to
0f3ca65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
599-644: Job-only validation for suspend/resume operations.Per the PR requirements, suspend and resume should only support Job objects, not Steps. The current implementation delegates validation to
SuspendRunningTasks()andResumeSuspendedTasks(), but explicit validation at the RPC handler level would make the Job-only restriction clearer and prevent potential issues if the architecture changes.Based on PR requirements comment from L-Xiafeng.
src/Craned/Core/Craned.cpp (1)
633-647: Make RuntimePause and RuntimeResume optional to preserve backward compatibility.The current implementation exits if
RuntimePauseorRuntimeResumeare missing, but the Supervisor'sContainerInstance::Suspend/Resumemethods provide fallback to SIGSTOP/SIGCONT when these commands are not configured. Making these fields required unnecessarily breaks existing deployments.Apply this diff to make the fields optional:
- if (container_config["RuntimePause"]) { - g_config.Container.RuntimePause = - container_config["RuntimePause"].as<std::string>(); - } else { - CRANE_ERROR("RuntimePause is not configured."); - std::exit(1); - } - - if (container_config["RuntimeResume"]) { - g_config.Container.RuntimeResume = - container_config["RuntimeResume"].as<std::string>(); - } else { - CRANE_ERROR("RuntimeResume is not configured."); - std::exit(1); - } + g_config.Container.RuntimePause = + YamlValueOr(container_config["RuntimePause"], ""); + g_config.Container.RuntimeResume = + YamlValueOr(container_config["RuntimeResume"], "");src/CraneCtld/TaskScheduler.cpp (1)
1283-1299: Use real step id instead of0inStepStatusChangeAsyncon suspend/resume failureIn both
handle_failurelambdas (Lines 1283-1299 and 1426-1442), you call:StepStatusChangeAsync( task_id, 0, craned_id, crane::grpc::TaskStatus::Failed, ExitCode::kExitCodeRpcError, "", google::protobuf::util::TimeUtil::GetCurrentTime());Step id
0does not correspond to any real step, soCleanTaskStatusChangeQueueCb_()drops these updates as “unknown step”, and the job’s in‑memory/DB status may never be transitioned to a final state ifTerminateStepsRPCs fail. This is the same issue that was raised previously and is still present here.You already use
kDaemonStepIdfor failure fan‑out inTerminateTasksOnCranedand wire daemon steps specially inCleanTaskStatusChangeQueueCb_(), so reusing that here keeps the failure paths consistent.Consider updating both suspend and resume failure paths like this:
- for (const auto& craned_id : executing_nodes) { - StepStatusChangeAsync( - task_id, 0, craned_id, crane::grpc::TaskStatus::Failed, - ExitCode::kExitCodeRpcError, "", - google::protobuf::util::TimeUtil::GetCurrentTime()); - } + auto now = google::protobuf::util::TimeUtil::GetCurrentTime(); + for (const auto& craned_id : executing_nodes) { + StepStatusChangeAsync( + task_id, kDaemonStepId, craned_id, + crane::grpc::TaskStatus::Failed, ExitCode::kExitCodeRpcError, "", + now); + }Apply the same change in both
SuspendRunningTasksandResumeSuspendedTasks.Also applies to: 1426-1442
src/Craned/Supervisor/TaskManager.cpp (2)
1104-1225: Container-specific Suspend/Resume correctly avoid destructive Kill pathThe new
ContainerInstance::Suspend()/Resume()overrides use the OCI runtimestateplusRuntimePause/RuntimeResumecommands when the container is in"running"/"paused"states, and fall back to the baseITaskInstance::Suspend()/Resume()(process-groupSIGSTOP/SIGCONT) only on parse errors, missing commands, or unexpected states.This addresses the earlier design problem where suspend/resume went through
ContainerInstance::Kill(SIGSTOP/SIGCONT)and triggered the full kill/delete workflow. Container workloads are now paused via the runtime instead of being torn down, with a reasonable best-effort signal fallback.No additional issues from this change.
1622-1667: Suspend/Resume async pipeline is consistent, but consider guarding on step stateThe new
m_task_signal_async_handle_+SuspendJobAsync/ResumeJobAsync+EvCleanTaskSignalQueueCb_wiring matches the existing async patterns (terminate/time-limit) and looks sound: requests are funneled onto the UV loop, a singleTaskSignalQueueElemcarries the action and promise, andSuspendRunningTasks_/ResumeSuspendedTasks_aggregate per-task results correctly (first non-success is preserved).Two points worth tightening:
Step state guard
UnlikeEvCleanChangeTaskTimeLimitQueueCb_andEvCleanTerminateTaskQueueCb_, the suspend/resume path does not checkg_runtime_status.CanStepOperate()or the step’s own status before iterating tasks. If a suspend arrives while the step isCompletingor already finished, you’ll either:
- get
ERR_NON_EXISTENTfrom tasks withm_pid_ == 0, or- silently do nothing but still report
SUCCESSwhen all tasks are gone.To make behaviour predictable and consistent with other operations, consider adding a guard near the top of
SuspendRunningTasks_()andResumeSuspendedTasks_()(or before dispatch inEvCleanTaskSignalQueueCb_()) that:
- returns an explicit invalid-state error if the step is not
Running/Suspended, or- short-circuits to
SUCCESSwhenm_step_.AllTaskFinished().This also aligns with the prior suggestion to ensure suspends only happen from a Running state.
Live-task iteration
Iteratingm_step_.task_idsand skippingnullptrfromGetTaskInstance()is safe and avoids the previousat()-throw issue. If you ever want “only actually running tasks”, switching to iteratem_pid_task_id_map_ | std::views::values(similar to terminate) would more directly reflect instantiated tasks, but that’s an optional refinement.None of this is blocking, but adding the state guard would make the suspend/resume semantics clearer and less error-prone under races.
Also applies to: 1835-1861, 2136-2209
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
163-187: FreeJobs should take job IDs, not task IDs (type/name mismatch)
CranedStub::FreeJobscurrently takesconst std::vector<task_id_t>& taskbut fillsFreeJobsRequest::job_id_listfrom it:request.mutable_job_id_list()->Assign(task.begin(), task.end());Semantically this RPC operates on job IDs, and a maintainer has already pointed out the expected type (
vector<job_id_t>). Keepingtask_id_there is confusing and makes it easier to misuse the API from call sites.Consider changing both the parameter type and name, e.g.:
-CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& task) { +CraneErrCode CranedStub::FreeJobs(const std::vector<job_id_t>& job_ids) { ... - request.mutable_job_id_list()->Assign(task.begin(), task.end()); + request.mutable_job_id_list()->Assign(job_ids.begin(), job_ids.end());This aligns the stub’s signature with the proto and the PR’s focus on job-level operations.
🧹 Nitpick comments (2)
src/Craned/Supervisor/TaskManager.cpp (1)
112-126: GetTaskInstance now returns nullptr; ensure deref sites respect the new contractSwitching both
StepInstance::GetTaskInstanceoverloads tofind()and returningnullptron miss is good for robustness; it removes the risk ofat()throwing. However, several call sites (e.g.EvCleanSigchldQueueCb_andEvCleanTaskStopQueueCb_) still immediately dereference the returned pointer without checks, relying on invariants aboutm_task_map_.Consider either:
- adding
CRANE_ASSERT(task != nullptr)at those deref sites, or- doing a defensive null check with a log+early-return path,
to make the invariants explicit and failure mode clearer if the map gets out of sync in future refactors.
src/Craned/Core/CranedServer.cpp (1)
88-145: SuspendJobs/ResumeJobs job-level handling is sound; verify “job not found on this node” semanticsThe new
SuspendJobs/ResumeJobshandlers correctly:
- Enforce
ReadyFor(RequestSource::CTLD)like existing control RPCs.- Use
g_job_mgr->GetAllocatedJobSteps()to discover per-job steps on this Craned.- Skip daemon steps (
step_id == 0).- Call
SuspendJob()/ResumeJob()on each supervisor stub and aggregate errors intoreasons, returningok=falseplus a detailed reason string when any step fails.One behavioural detail to double-check with the overall design: when a requested
job_idis not present injob_steps, you currently treat this as an error (all_ok = false, reason"job X: Job not found"). That’s correct if Ctld only sends suspend/resume requests to nodes that are known to host the job and the absence indicates an inconsistency. However, if the RPC can reach nodes that never had or no longer have steps for that job (e.g., after cleanup), it may be more appropriate to log and ignore the missing job on that node instead of failing the whole RPC.Please confirm that the Ctld-side
CranedStub::SuspendJobs/ResumeJobsdispatch only targets nodes that still own allocated steps for the given jobs; if not, consider relaxing the “job not found” case to be non-fatal on that craned.Also applies to: 146-201
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
protos/Crane.proto(3 hunks)protos/PublicDefs.proto(1 hunks)protos/Supervisor.proto(3 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(27 hunks)src/CraneCtld/RpcService/CranedKeeper.h(7 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/CranedPublicDefs.h(1 hunks)src/Craned/Core/CranedServer.cpp(14 hunks)src/Craned/Core/CranedServer.h(1 hunks)src/Craned/Core/JobManager.h(0 hunks)src/Craned/Core/SupervisorKeeper.cpp(2 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/Supervisor.cpp(1 hunks)src/Craned/Supervisor/SupervisorPublicDefs.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(2 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(9 hunks)
💤 Files with no reviewable changes (1)
- src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (4)
- protos/PublicDefs.proto
- protos/Crane.proto
- src/CraneCtld/TaskScheduler.h
- src/Craned/Supervisor/SupervisorServer.h
🧰 Additional context used
🧠 Learnings (25)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/Craned.cppsrc/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/CranedServer.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Core/Craned.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Core/CranedServer.hsrc/Craned/Core/SupervisorKeeper.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.hsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Core/CranedServer.hsrc/Craned/Core/SupervisorKeeper.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/SupervisorKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/CranedServer.hsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/Craned/Core/SupervisorKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/Craned/Supervisor/SupervisorServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Core/SupervisorKeeper.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/CranedServer.hsrc/Craned/Core/SupervisorKeeper.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (13)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
src/Craned/Supervisor/TaskManager.cpp (8)
Suspend(608-621)Suspend(608-608)Suspend(1104-1164)Suspend(1104-1104)Resume(623-636)Resume(623-623)Resume(1166-1225)Resume(1166-1166)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(843-851)src/Craned/Core/JobManager.cpp (2)
StepStatusChangeAsync(1246-1255)StepStatusChangeAsync(1246-1250)src/Craned/Supervisor/CranedClient.cpp (2)
StepStatusChangeAsync(46-56)StepStatusChangeAsync(46-48)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
src/Craned/Supervisor/TaskManager.h (2)
src/Utilities/PublicHeader/include/crane/Logger.h (1)
crane(200-212)src/Craned/Supervisor/TaskManager.cpp (18)
Suspend(608-621)Suspend(608-608)Suspend(1104-1164)Suspend(1104-1104)Resume(623-636)Resume(623-623)Resume(1166-1225)Resume(1166-1166)SuspendJobAsync(1835-1847)SuspendJobAsync(1835-1835)ResumeJobAsync(1849-1861)ResumeJobAsync(1849-1849)EvCleanTaskSignalQueueCb_(2136-2152)EvCleanTaskSignalQueueCb_(2136-2136)SuspendRunningTasks_(2154-2181)SuspendRunningTasks_(2154-2154)ResumeSuspendedTasks_(2183-2209)ResumeSuspendedTasks_(2183-2183)
src/Craned/Supervisor/SupervisorServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(93-111)SuspendJob(93-93)ResumeJob(113-131)ResumeJob(113-113)
src/Craned/Supervisor/Supervisor.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(163-163)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (17)
CranedStub(28-34)Fini(36-38)Fini(36-36)l(403-403)ConfigureCraned(40-67)ConfigureCraned(40-41)job_id_t(208-242)SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)CranedKeeper(365-386)CranedKeeper(388-395)CranedChannelConnFailNoLock_(848-854)CranedChannelConnFailNoLock_(848-848)ConnectCranedNode_(758-846)ConnectCranedNode_(758-759)
src/Craned/Core/SupervisorKeeper.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(93-111)SuspendJob(93-93)ResumeJob(113-131)ResumeJob(113-113)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(95-109)SuspendJob(95-98)ResumeJob(111-125)ResumeJob(111-114)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
SuspendJobs(88-144)SuspendJobs(88-91)ResumeJobs(146-201)ResumeJobs(146-148)CranedServer(474-502)src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(270-297)SuspendJobs(270-270)ResumeJobs(299-326)ResumeJobs(299-299)
src/Craned/Core/SupervisorKeeper.cpp (1)
src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(95-109)SuspendJob(95-98)ResumeJob(111-125)ResumeJob(111-114)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
SuspendJobs(88-144)SuspendJobs(88-91)ResumeJobs(146-201)ResumeJobs(146-148)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(102-105)
🔇 Additional comments (18)
src/Craned/Core/CranedPublicDefs.h (1)
109-110: LGTM!The addition of
RuntimePauseandRuntimeResumefields toContainerConfigis straightforward and follows the existing pattern for other runtime command fields.src/Craned/Supervisor/Supervisor.cpp (1)
110-111: LGTM!The initialization of
RuntimePauseandRuntimeResumefrom the container config message is consistent with the existing pattern for other runtime commands.src/Craned/Supervisor/SupervisorPublicDefs.h (1)
61-62: LGTM!The addition of
RuntimePauseandRuntimeResumefields mirrors the changes inCranedPublicDefs.hand maintains consistency across the supervisor configuration.src/Craned/Core/SupervisorKeeper.cpp (1)
93-131: LGTM!The
SuspendJob()andResumeJob()RPC wrappers correctly handle both transport failures (returningERR_RPC_FAILURE) and application-level errors (mapping specific error reasons to appropriateCraneErrCodevalues). The implementation follows the established pattern used by other RPC wrappers in this class.src/Craned/Core/SupervisorKeeper.h (1)
39-40: LGTM!The method declarations for
SuspendJob()andResumeJob()are consistent with the existing API design, returningCraneErrCodeto indicate success or failure.src/Craned/Supervisor/SupervisorServer.cpp (1)
95-125: LGTM!The
SuspendJobandResumeJobRPC handlers correctly follow the established async pattern: they invoke the corresponding async methods ong_task_mgr, wait for completion, and populate the response payload with success/failure information. ReturningStatus::OKwith error details in the response is consistent with CraneSched conventions.Based on learnings.
protos/Supervisor.proto (2)
44-53: Container pause/resume commands fit existing ContainerConfig pattern
pause_cmdandresume_cmdmirror the existingstate/run/kill/deletecommands and keep numbering contiguous; no issues from the proto/API perspective.
127-139: Job-level suspend/resume RPCs are consistent with existing API
SuspendJob/ResumeJobrequest/reply shapes (ok+reason) and naming align withTerminateTaskReplyand the “Job-only” semantics requested in the PR comments. The service methods match the message names cleanly; looks good.Also applies to: 151-152
src/CraneCtld/TaskScheduler.cpp (1)
1237-1383: Suspend/resume orchestration and rollback logic are otherwise well-structuredApart from the step-id issue above, the flows look solid:
- Strict validation of task presence and status (
RunningvsSuspended).- Pre-check of all executing nodes’ online state before issuing RPCs.
- Thread-pool–based broadcast of
SuspendJobs/ResumeJobswith shared error aggregation.- On any per-node RPC failure, you terminate the job and propagate a uniform controller-side failure code.
- On success, you update in-memory status and persist it with rollback on DB failure.
This matches prior review guidance on using thread pools for broadcast RPCs and keeping Ctld/craned state consistent.
Also applies to: 1385-1524
src/CraneCtld/RpcService/CranedKeeper.h (1)
37-43: CranedStub suspend/resume API and const-correct signatures look consistentThe tightened signatures (
constreferences for IDs/tokens/containers) and the newSuspendJobs/ResumeJobsmethods usingstd::vector<job_id_t>line up with the job-level suspend/resume flow implemented inCranedKeeper.cpp. UsingCranedStub*insideCqTagand function callbacks is consistent with existing lifetime guarantees in CranedKeeper; no additional issues from the header side.Also applies to: 47-52, 74-90, 115-116, 155-159, 166-167, 178-181, 184-185, 192-199, 200-205, 211-212, 225-244
src/Craned/Supervisor/TaskManager.h (2)
59-60: Per-instance suspend/resume hooks and status field align with new semanticsAdding
StepInstance::statusas acrane::grpc::TaskStatusand extendingITaskInstance/ContainerInstancewithSuspend()/Resume()gives Supervisor a clean hook to implement job-level suspend/resume. The interface surface matches the implementations inTaskManager.cpp(process-group SIGSTOP/SIGCONT with container-specific overrides), and types are consistent with the rest of the codebase.Also applies to: 164-177, 260-271, 277-282
364-367: TaskManager’s suspend/resume async queue is consistent with existing event model
SuspendJobAsync/ResumeJobAsyncplusTaskSignalQueueElem,m_task_signal_async_handle_, andEvCleanTaskSignalQueueCb_()reuse the same uvw + ConcurrentQueue pattern as other TaskManager actions (execute, terminate, change time limit). RunningSuspendRunningTasks_()/ResumeSuspendedTasks_()on the loop thread avoids extra synchronization aroundm_step_and task instances. Behavior and threading look correct from the header’s point of view.Also applies to: 389-392, 402-403, 407-408, 428-429
src/Craned/Supervisor/TaskManager.cpp (1)
608-636: Per-task Suspend/Resume implementation for non-container tasks looks correct
ITaskInstance::Suspend()/Resume()sendingSIGSTOP/SIGCONTto-m_pid_(the process group) matches the earliersetpgid(0, 0)setup, so the whole task group is stopped/resumed atomically. ReturningERR_NON_EXISTENTwhenm_pid_ == 0is a reasonable way to signal “not spawned yet / already gone”.I don’t see correctness issues here; behaviour aligns with typical job-control semantics.
src/Craned/Core/CranedServer.cpp (1)
32-61: Signature and loop-formatting cleanups preserve existing behaviourThe pointer-style updates on
ServerContext/request/response parameters and the structured-bindingfor (const auto& [...])loops inConfigure,ExecuteSteps,TerminateSteps,TerminateOrphanedStep,AllocJobs,AllocSteps,FreeSteps,FreeJobs,QuerySshStepEnvVariables,ChangeJobTimeLimit, andStepStatusChangeare purely stylistic. The underlying control-flow and request/response handling remain unchanged and consistent with existing patterns in this file.No functional concerns from these changes.
Also applies to: 63-86, 203-226, 338-405, 407-472, 474-502
src/Craned/Core/CranedServer.h (1)
53-118: New SuspendJobs/ResumeJobs RPCs and signature cleanups look consistentThe added
SuspendJobs/ResumeJobsdeclarations onCranedServiceImplalign with the implementations inCranedServer.cppand the generatedcrane::grpc::Cranedservice. The pointer-style and const-correctness updates across the other RPC declarations, plus theCranedServerconstructor signature tweak, are consistent formatting changes with no behavioural impact.Header/implementation stay in sync; no issues here.
Also applies to: 120-123
src/CraneCtld/RpcService/CranedKeeper.cpp (3)
69-99: Job/step ID marshalling in CranedStub RPCs is consistent and correctThe refactors in:
TerminateSteps,TerminateOrphanedSteps,AllocJobs,AllocSteps,ExecuteSteps,FreeStepsall use the proto’s
job_step_ids_mapfields withmutable_job_step_ids_map()andAssign()over thestd::set<step_id_t>collections. This matches the corresponding server-side handlers and avoids redundant copies.These changes improve const-correctness on the input containers and keep the wire format unchanged. No issues here.
Also applies to: 100-133, 189-207, 209-242, 244-268
270-297: SuspendJobs/ResumeJobs stubs mirror existing patterns and look correct
CranedStub::SuspendJobsandCranedStub::ResumeJobs:
- Build the respective request types and assign
job_id_listfrom the providedstd::vector<job_id_t>.- Set per-RPC deadlines using
kCtldRpcTimeoutSeconds, consistent with other methods.- Map non-OK gRPC status to
ERR_RPC_FAILUREviaHandleGrpcErrorCode_, and updatem_last_active_time_on success.- Interpret
reply.ok()/reply.reason()the same way as other control operations.These wrappers are consistent with the existing
Terminate*/Alloc*/Free*stubs; no functional issues spotted.Also applies to: 299-326
525-622: CqTag-based state machines and connection management changes look coherentThe refactors to:
InitCranedStateMachine_,EstablishedCranedStateMachine_,CheckNodeTimeoutAndClean,IsCranedConnected/GetCranedStub,PutNodeIntoUnavailSet,ConnectCranedNode_,CranedChannelConnFailNoLock_,all consistently use
CqTagcarrying aCranedStub*, and them_clean_up_cb_hookup allowsCranedStub::Fini()to notifyCranedKeeperwhen a channel fails during initial connection. The read/write locking patterns onm_connect_craned_mtx_andm_unavail_craned_set_mtx_are preserved, and the timeouts and connection-backoff behaviour are unchanged.Aside from the
StateMonitorThreadFunc_tag handling noted separately, these changes look correct and maintain the existing channel-state semantics.Also applies to: 624-691, 725-757, 758-846, 848-855
| for (auto& cq_thread : m_cq_thread_vec_) cq_thread.join(); | ||
| if (m_period_connect_thread_.joinable()) m_period_connect_thread_.join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Avoid dereferencing CQ tag pointer on TIMEOUT during shutdown
In CranedKeeper::StateMonitorThreadFunc_ you now use CqTag* tag and call:
next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
if (m_cq_closed_) {
if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
continue;
}
if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
if (next_status == grpc::CompletionQueue::GOT_EVENT) {
...
if (CheckNodeTimeoutAndClean(tag)) continue;
...
craned->m_channel_->NotifyOnStateChange(..., tag);
}When AsyncNext returns TIMEOUT, the contents of tag are unspecified; dereferencing tag in the m_cq_closed_ block before the TIMEOUT check is therefore unsafe. If Shutdown() has been called (setting m_cq_closed_ = true) but the CQ returns a timeout, this can lead to undefined behaviour or crashes on shutdown.
A safer pattern is:
- Only interpret
tagas a validCqTag*whennext_status == GOT_EVENT. - Handle
TIMEOUTandSHUTDOWNwithout touchingtag. - Apply the
m_cq_closed_cleanup only in the GOT_EVENT branch.
For example:
- CqTag* tag;
+ CqTag* tag = nullptr;
...
- next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
- if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
-
- if (m_cq_closed_) {
- if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
- continue;
- }
-
- if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
- if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+ next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl);
+
+ if (next_status == grpc::CompletionQueue::SHUTDOWN) break;
+ if (next_status == grpc::CompletionQueue::TIMEOUT) continue;
+
+ if (next_status == grpc::CompletionQueue::GOT_EVENT) {
+ if (m_cq_closed_) {
+ if (tag->type == CqTag::kInitializingCraned) delete tag->craned;
+ m_tag_sync_allocator_->delete_object(tag);
+ continue;
+ }
@@
- auto* craned = tag->craned;
+ auto* craned = tag->craned;
@@
- if (next_tag) {
+ if (next_tag) {
if (!m_cq_closed_) {
if (CheckNodeTimeoutAndClean(tag)) continue;
...(Exact cleanup behaviour for the closed case can be adapted to your needs, but the key is to never dereference tag on TIMEOUT.)
This change keeps the state machine logic intact while eliminating a shutdown-time race/UB risk.
Also applies to: 424-439, 693-718
I'll verify the review comment by examining the actual code in the repository.
---
🏁 Script executed:
```shell
#!/bin/bash
# Locate and examine CranedKeeper.cpp
fd -t f "CranedKeeper.cpp" | head -20
Length of output: 41
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the lines mentioned in the review comment
echo "=== Lines 391-392 ==="
sed -n '385,400p' src/CraneCtld/RpcService/CranedKeeper.cpp
echo ""
echo "=== Lines 424-439 ==="
sed -n '418,445p' src/CraneCtld/RpcService/CranedKeeper.cpp
echo ""
echo "=== Lines 693-718 ==="
sed -n '687,725p' src/CraneCtld/RpcService/CranedKeeper.cpp
Length of output: 2754
---
🏁 Script executed:
```shell
#!/bin/bash
# Get broader context around lines 693-718 to see if similar pattern exists
echo "=== Context around lines 693-718 (looking for StateMonitorThreadFunc_ or similar) ==="
sed -n '650,750p' src/CraneCtld/RpcService/CranedKeeper.cpp | head -100
Length of output: 3082
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for all AsyncNext calls and surrounding context to find similar patterns
echo "=== All AsyncNext occurrences in CranedKeeper.cpp ==="
rg -n "AsyncNext" src/CraneCtld/RpcService/CranedKeeper.cpp -B 5 -A 20
Length of output: 1048
Unsafe tag pointer deref on TIMEOUT confirmed, but scope reference is incorrect
The unsafe pattern is confirmed at lines 424-439 in StateMonitorThreadFunc_, where tag->type is dereferenced in the m_cq_closed_ block (line 434) before checking if next_status == TIMEOUT (line 438). Per gRPC API contract, tag contents are unspecified on TIMEOUT, making this a genuine shutdown-time race condition that can cause undefined behavior or crashes.
However, the review comment's reference to "also applies to: 693-718" is incorrect—those lines contain state machine logic without similar AsyncNext patterns. The unsafe dereference pattern exists only at lines 424-439.
The fix recommended in the review comment (moving m_cq_closed_ cleanup into the GOT_EVENT branch after TIMEOUT and SHUTDOWN checks) is the correct approach.
🤖 Prompt for AI Agents
In src/CraneCtld/RpcService/CranedKeeper.cpp around lines 424-439,
StateMonitorThreadFunc_ currently dereferences tag->type inside the m_cq_closed_
block before checking if next_status == TIMEOUT, which is unsafe because gRPC
leaves tag contents unspecified on TIMEOUT; move the m_cq_closed_ cleanup logic
out of the branch that may run on TIMEOUT and into the GOT_EVENT branch after
you check for TIMEOUT and SHUTDOWN so that tag is only dereferenced when
next_status == GOT_EVENT (and therefore valid); ensure TIMEOUT and SHUTDOWN are
checked first, skip any tag access on TIMEOUT, and only perform tag-based
cleanup when next_status indicates a real event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
868-873:ContainerInstance::Cleanupis defined twice and the new one doesn’t return.There are now two
CraneErrCode ContainerInstance::Cleanup()definitions (Lines 868–873 and 1284–1289). The second one also omits areturn, which is UB for a non-void function.You need a single, merged implementation that both:
- tears down pod/container as before, and
- asynchronously deletes the temporary bundle directory.
For example, fold the new logic into the existing definition and remove the duplicate:
CraneErrCode ContainerInstance::Cleanup() { - // For container tasks, Kill() is idempotent. - // It's ok to call it (at most) twice to remove pod and container. - CraneErrCode err = Kill(0); - return err; + // Delete temporary bundle directory if present. + if ((m_parent_step_inst_->IsBatch() || m_parent_step_inst_->IsCrun()) && + !m_temp_path_.empty()) { + g_thread_pool->detach_task( + [p = m_temp_path_]() { util::os::DeleteFolders(p); }); + } + + // For container tasks, Kill() is idempotent; safe to call during cleanup. + return Kill(0); }and drop the later
Cleanupdefinition entirely.Also applies to: 1284-1289
♻️ Duplicate comments (3)
protos/Supervisor.proto (1)
122-135: Supervisor suspend/resume RPCs align with single-job supervisor model
SuspendJob/ResumeJoband their replies are minimal and job-scoped, consistent with the existing Supervisor API where the instance already knows its job/step. Naming matches prior “Task→Job” feedback and the PR goal of job-level operations.Also applies to: 140-148
src/CraneCtld/TaskScheduler.cpp (1)
1297-1313: Avoid StepStatusChangeAsync withstep_id == 0in failure handlersIn both suspend and resume failure handlers you call:
StepStatusChangeAsync(task_id, 0, craned_id, crane::grpc::TaskStatus::Failed, ExitCode::kExitCodeRpcError, "", timestamp);
CleanTaskStatusChangeQueueCb_looks up steps by real step IDs (daemon/primary/common) and logsIgnoring unknown stepwhen the ID doesn’t match any step. Using0here thus just produces warnings without contributing to cleanup, which is already handled viaTerminateRunningTaskNoLock_and the cancel queue.Consider either:
- Switching to
kDaemonStepIdif you really want an explicit failure status from this path, or- Dropping this extra
StepStatusChangeAsyncloop and relying solely onTerminateRunningTaskfor cleanup and status propagation.Functionality is currently correct, but cleaning this up would reduce log noise and redundant status events.
Also applies to: 1440-1456
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
476-580: Still dereferencing CQ tag on TIMEOUT whenm_cq_closed_is set.This is the same issue noted previously:
AsyncNextleavestagunspecified onNextStatus::TIMEOUT, but them_cq_closed_branch dereferencestag->typebefore you check for TIMEOUT. IfShutdown()has setm_cq_closed_ = truewhile the CQ returnsTIMEOUT, this is undefined behavior.You should only touch
tagwhennext_status == GOT_EVENT, e.g.:- bool ok; - CqTag* tag; + bool ok; + CqTag* tag = nullptr; ... - next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl); - - // If Shutdown() is called, return immediately. - if (next_status == grpc::CompletionQueue::SHUTDOWN) break; - - if (m_cq_closed_) { - if (tag->type == CqTag::kInitializingCraned) delete tag->craned; - continue; - } - - if (next_status == grpc::CompletionQueue::TIMEOUT) continue; - if (next_status == grpc::CompletionQueue::GOT_EVENT) { + next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl); + + if (next_status == grpc::CompletionQueue::SHUTDOWN) break; + if (next_status == grpc::CompletionQueue::TIMEOUT) continue; + + if (next_status == grpc::CompletionQueue::GOT_EVENT) { + if (m_cq_closed_) { + if (tag->type == CqTag::kInitializingCraned) delete tag->craned; + m_tag_sync_allocator_->delete_object(tag); + continue; + } // existing state-machine logic …so that
tagis never dereferenced on TIMEOUT.In gRPC C++'s `grpc::CompletionQueue::AsyncNext`, what are the guarantees about the `tag` pointer when `NextStatus::TIMEOUT` is returned? Is it valid to dereference `tag` in that case, or is the tag value unspecified?
🧹 Nitpick comments (4)
src/Craned/Supervisor/TaskManager.h (1)
454-457: TaskManager async suspend/resume wiring matches existing event-loop patternsThe new job-level APIs (
SuspendJobAsync/ResumeJobAsync),TaskSignalQueueElemqueue,m_task_signal_async_handle_,EvCleanTaskSignalQueueCb_, andSuspendRunningTasks_/ResumeSuspendedTasks_all follow the established uvw + ConcurrentQueue + promise pattern used elsewhere inTaskManager. They correctly:
- Enqueue suspend/resume actions with a promise and trigger the async handle.
- Drain the queue in
EvCleanTaskSignalQueueCb_and route to the appropriate internal method.- Apply suspend/resume across all
m_step_.task_idsvia the task instances.Assuming
m_task_signal_async_handle_is initialized and bound toEvCleanTaskSignalQueueCb_inTaskManager.cpp(as with the other async handles), this looks solid.Also applies to: 479-482, 498-505, 531-533
src/Craned/Supervisor/TaskManager.cpp (2)
1161-1221: Container suspend/resume logic is reasonable with OCI-first then POSIX fallback.The new
ContainerInstance::Suspend/Resumemethods correctly:
- query runtime state via
RuntimeState,- call
RuntimePause/RuntimeResumewhen the container is inrunning/paused,- log and fall back to
ITaskInstance::Suspend/Resumewhen the runtime path fails or is misconfigured.This avoids the previous destructive
Kill(SIGSTOP/SIGCONT)behavior and keeps container semantics aligned with processes.For readability you might prefer explicit JSON conversion:
status = jret["status"].get<std::string>();instead of relying on
status = std::move(jret["status"]);, but behavior is fine as-is.
1706-1711: Suspend/resume async pipeline wiring looks consistent; consider state guards.The new
m_task_signal_async_handle_,SuspendJobAsync/ResumeJobAsync,EvCleanTaskSignalQueueCb_, andSuspendRunningTasks_/ResumeSuspendedTasks_follow the same queue + uvw-async pattern used for terminate/time-limit changes, and they now aggregate per-task errors instead of short-circuiting, which is good.Two behavioral points to consider:
SuspendRunningTasks_/ResumeSuspendedTasks_returnSUCCESSeven if no task instance is actually suspended/resumed (e.g., all IDs missing fromm_task_map_). If you want to surface “nothing to operate on” as an error, track whether at least one task was acted on and return an appropriate code when not.- Per earlier reviewer feedback (“m_step_.status首先应该是Running”), you may want to add a simple status guard (e.g. check
g_runtime_status.Statusorm_step_.IsRunning()/ “is suspended”) before iterating, similar to howEvCleanChangeTaskTimeLimitQueueCb_usesCanStepOperate(). That would avoid silently “succeeding” on already-completing/completed steps.If the current behavior is intentional, no functional change is strictly required.
Also applies to: 1899-1925, 2217-2233, 2235-2290
src/CraneCtld/RpcService/CranedKeeper.h (1)
76-77: Minor naming/type mismatch inFreeJobsparameter.
FreeJobstakesconst std::vector<task_id_t>& task, but the RPC it drives uses ajob_id_list. For consistency withAllocJobs/SuspendJobs/ResumeJobsand the “job-level” semantics, consider changing this toconst std::vector<job_id_t>& job_ids(and rename the variable) in both declaration and definition.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
protos/Crane.proto(3 hunks)protos/PublicDefs.proto(1 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(27 hunks)src/CraneCtld/RpcService/CranedKeeper.h(7 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/CranedServer.cpp(14 hunks)src/Craned/Core/CranedServer.h(2 hunks)src/Craned/Core/JobManager.h(0 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)
💤 Files with no reviewable changes (1)
- src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (3)
- src/Craned/Supervisor/SupervisorServer.h
- protos/PublicDefs.proto
- src/CraneCtld/RpcService/CtldGrpcServer.cpp
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/CraneCtld/TaskScheduler.cpp:2811-2874
Timestamp: 2025-09-22T07:13:20.635Z
Learning: In the CraneSched codebase, there are latch mechanisms used to ensure variable lifetime safety when capturing references in detached thread pool tasks, allowing reference captures that might otherwise be unsafe.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Core/CranedServer.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Core/CranedServer.hsrc/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CranedServer.hsrc/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (8)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SuspendRunningTasks(1251-1397)SuspendRunningTasks(1251-1252)ResumeSuspendedTasks(1399-1538)ResumeSuspendedTasks(1399-1400)
src/CraneCtld/TaskScheduler.cpp (4)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(849-857)src/Craned/Core/JobManager.cpp (2)
StepStatusChangeAsync(1231-1240)StepStatusChangeAsync(1231-1235)src/Craned/Core/CtldClient.cpp (2)
StepStatusChangeAsync(564-573)StepStatusChangeAsync(564-565)src/Craned/Supervisor/CranedClient.cpp (2)
StepStatusChangeAsync(46-56)StepStatusChangeAsync(46-48)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
Suspend(606-619)Suspend(606-606)Suspend(1161-1221)Suspend(1161-1161)Resume(621-634)Resume(621-621)Resume(1223-1282)Resume(1223-1223)SuspendJobAsync(1899-1911)SuspendJobAsync(1899-1899)ResumeJobAsync(1913-1925)ResumeJobAsync(1913-1913)EvCleanTaskSignalQueueCb_(2217-2233)EvCleanTaskSignalQueueCb_(2217-2217)SuspendRunningTasks_(2235-2262)SuspendRunningTasks_(2235-2235)ResumeSuspendedTasks_(2264-2290)ResumeSuspendedTasks_(2264-2264)
src/Craned/Supervisor/TaskManager.cpp (3)
src/Craned/Supervisor/TaskManager.h (3)
GetTaskInstance(133-136)GetTaskInstance(138-141)m_pid_(378-378)src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/OS.cpp (2)
DeleteFolders(73-82)DeleteFolders(73-73)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
SuspendJobs(92-148)SuspendJobs(92-95)ResumeJobs(150-205)ResumeJobs(150-152)CranedServer(640-668)src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
CranedStub(29-35)Fini(37-39)Fini(37-37)l(460-460)ConfigureCraned(41-68)ConfigureCraned(41-42)AllocJobs(135-162)AllocJobs(135-136)FreeJobs(164-188)FreeJobs(164-164)AllocSteps(190-208)AllocSteps(190-191)job_id_t(210-244)ExecuteSteps(211-212)FreeSteps(246-270)FreeSteps(246-247)SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)TerminateSteps(70-99)TerminateSteps(70-71)TerminateOrphanedSteps(101-133)TerminateOrphanedSteps(101-102)CranedKeeper(422-443)CranedKeeper(445-452)Shutdown(454-474)Shutdown(454-454)AvailableCranedCount(777-780)AvailableCranedCount(777-777)IsCranedConnected(782-785)IsCranedConnected(782-782)GetCranedStub(787-794)GetCranedStub(787-788)SetCranedConnectedCb(796-799)SetCranedConnectedCb(796-797)SetCranedDisconnectedCb(801-803)SetCranedDisconnectedCb(801-801)PutNodeIntoUnavailSet(805-813)PutNodeIntoUnavailSet(805-806)CranedChannelConnFailNoLock_(905-911)CranedChannelConnFailNoLock_(905-905)ConnectCranedNode_(815-903)ConnectCranedNode_(815-816)InitCranedStateMachine_(582-679)InitCranedStateMachine_(582-583)EstablishedCranedStateMachine_(681-748)EstablishedCranedStateMachine_(681-682)CheckNodeTimeoutAndClean(750-775)CheckNodeTimeoutAndClean(750-750)StateMonitorThreadFunc_(476-580)StateMonitorThreadFunc_(476-476)PeriodConnectCranedThreadFunc_(913-949)PeriodConnectCranedThreadFunc_(913-913)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
src/Craned/Core/CranedServer.cpp (4)
SuspendJobs(92-148)SuspendJobs(92-95)ResumeJobs(150-205)ResumeJobs(150-152)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(108-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (15)
src/CraneCtld/TaskScheduler.h (1)
762-765: Suspend/Resume TaskScheduler APIs look consistent and well‑scopedThe new
SuspendRunningTasks/ResumeSuspendedTasksAPIs match the implementations inTaskScheduler.cppand expose a clear per‑task error vector for higher layers. No issues from the header side.protos/Crane.proto (1)
143-161: Job-level suspend/resume protobufs and Craned RPCs are well designedThe new
SuspendJobs*/ResumeJobs*messages,TargetAttributes.Suspend/Resumeenum values, andCraned.SuspendJobs/ResumeJobsRPCs are all job-centric and consistent with existing naming and control patterns. The wire format remains backward-compatible and matches the scheduler-side usage.Also applies to: 249-271, 1093-1094
src/CraneCtld/TaskScheduler.cpp (1)
1251-1397: Suspend/Resume workflows are logically sound and consistent with scheduler designThe new
SuspendRunningTasks/ResumeSuspendedTasksimplementations correctly:
- Guard access to
m_running_task_map_with the existing mutex and validate task state (Running↔Suspended) before acting.- Require all executing craneds to be online before issuing suspend/resume, and treat any offline craned as an RPC failure.
- Broadcast
SuspendJobs/ResumeJobsvia the thread pool withBlockingCounter, aggregating per-node failures into a singlefailure_code.- On any RPC failure, terminate the task via
TerminateRunningTaskto avoid mixed states across nodes.- On success, update in‑memory status and persist to the embedded DB, rolling back the in‑memory status if persistence fails to keep RAM/DB consistent.
This matches prior resource‑lifecycle patterns in
TaskSchedulerand should behave robustly under partial failures.Also applies to: 1399-1538
src/Craned/Supervisor/TaskManager.h (1)
217-223: Base and container-specific Suspend/Resume contracts are coherentAdding default
ITaskInstance::Suspend/Resume(SIGSTOP/SIGCONT on the process group) and overriding them inContainerInstanceto use OCI runtime pause/resume with a fallback to the base behavior is a clean extension of the existing abstraction. Header and cpp implementations are consistent.Also applies to: 252-257
src/Craned/Supervisor/TaskManager.cpp (1)
606-634: BaseITaskInstance::Suspend/Resumelook correct.Sending
SIGSTOP/SIGCONTto the process group viakill(-m_pid_, …)with clear success/error codes is consistent with existingKillsemantics and provides a clean default for non-container tasks.src/CraneCtld/RpcService/CranedKeeper.h (2)
36-42: CranedStub suspend/resume APIs now clearly job-centric.The new
SuspendJobs(const std::vector<job_id_t>&)andResumeJobs(const std::vector<job_id_t>&)methods, plus the consistent use ofjob_id_tin their signatures, match the requirement that suspend/resume operate on Job objects, not steps. Deleting copy/move onCranedStubalso keeps ownership clear.Also applies to: 74-88
121-139: Const-ref and callback signature cleanups look good.The switch to const references for
CranedId/RegToken, the guarded optional token field, and the updated callback types (std::function<void(CranedId, const RegToken&)>) improve clarity and avoid unnecessary copies without changing behavior.Also applies to: 144-145, 172-188, 190-191, 200-211, 217-239
src/Craned/Core/CranedServer.cpp (2)
92-205: Suspend/Resume job RPC handlers are well-aligned with Supervisor layer.
- You gate on
ReadyFor(CTLD)and returnUNAVAILABLEwith a reason, matching existing CTLD-facing RPCs.- For each requested job, you:
- look up allocated steps via
GetAllocatedJobSteps(),- ignore daemon steps (
step_id == 0),- fan out to per-step
SupervisorKeeper::GetStub(job_id, step_id)and invokeSuspendJob/ResumeJob,- aggregate per-step errors into
reasonsand setokonly when all steps succeed.This matches the “job-level” suspend/resume requirement while still operating per-step internally and avoids touching daemon steps that have no actual tasks.
46-65: Other RPC signature and loop updates are mechanical and safe.The pointer-style signature adjustments (
ServerContext* context, const request/response pointers) and the use of structured bindings / range-based loops inExecuteSteps,TerminateSteps,TerminateOrphanedStep,FreeSteps,QuerySshStepEnvVariables, andStepStatusChangeare all stylistic; they preserve existing logic and error-handling behavior.Also applies to: 67-90, 207-230, 232-340, 342-363, 364-393, 395-410, 412-436, 438-464, 624-638
src/Craned/Core/CranedServer.h (2)
53-78: CranedServiceImpl interface correctly exposes SuspendJobs/ResumeJobs.The new
SuspendJobsandResumeJobsdeclarations (withServerContext*and const request/response pointers) match the implementations inCranedServer.cppand integrate cleanly with the existing CTLD-oriented RPC set.
89-109: Signature and constructor tweaks are consistent with implementation.Switching RPCs like
AllocJobs/FreeJobs/AllocSteps/FreeSteps/ChangeJobTimeLimitto useServerContext* contextand takingCranedListenConfby const reference in the constructor simply aligns the header with the definitions; no behavior change, and the API remains clear.Also applies to: 126-129
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
272-299: CranedStub Suspend/Resume job RPC helpers are correct.Both helpers follow the existing RPC pattern:
- Build
SuspendJobsRequest/ResumeJobsRequestwithjob_id_listfrom the providedstd::vector<job_id_t>.- Apply a standard deadline, call the stub, handle non-OK
StatusasERR_RPC_FAILURE(withHandleGrpcErrorCode_+UpdateLastActiveTime()).- Interpret
reply.ok()vsreply.reason()as success vsERR_GENERIC_FAILURE.This matches the new Craned-side RPCs and the desired job-level suspend/resume semantics.
Also applies to: 301-328
422-443: CranedKeeper ctor/destructor/Shutdown changes are mechanical.Using
std::pmr::synchronized_pool_resource, allocatingm_tag_sync_allocator_, computingthread_numviastd::bit_ceil, and the updated joins/Shutdown loop are consistent with prior behavior; the only changes are pointer/loop style and don’t affect semantics.Also applies to: 445-474, 459-465
582-679: Channel state-machine helpers and timeout cleanup remain consistent.
InitCranedStateMachine_,EstablishedCranedStateMachine_, andCheckNodeTimeoutAndCleanall use the updatedCqTagfields (CranedStub* craned) and continue to manage connected/connecting sets, tokens, and callbacks as before. The timeout path now deletes the tag insideCheckNodeTimeoutAndClean, which is correctly accounted for by the caller’sif (CheckNodeTimeoutAndClean(tag)) continue;logic.Also applies to: 681-748, 750-775
782-813: Connection management const-ref and cleanup callback wiring look good.
IsCranedConnected/GetCranedStubnow takeconst CranedId&, and the connected callback usesstd::function<void(CranedId, const RegToken&)>, which reduces copying.ConnectCranedNode_takesCranedId const&andconst RegToken&, caches resolved IPs, creates the channel, and setsm_clean_up_cb_toCranedChannelConnFailNoLock_, which assumesm_connect_craned_mtx_is already held (as enforced by itsAssertHeld()).- The cleanup callback itself just decrements the channel count and erases the craned ID from the connecting set under the lock.
All of this matches the surrounding locking and lifecycle expectations.
Also applies to: 815-903, 905-911
df67cee to
0f3ca65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
480-496: Unsafe CQ tag dereference on TIMEOUT; move closed-queue cleanup under GOT_EVENTOn TIMEOUT, tag is unspecified. You dereference tag in the m_cq_closed_ branch before checking TIMEOUT, risking UB during shutdown. Initialize tag, check TIMEOUT first, and only touch tag inside GOT_EVENT. Example fix:
- bool ok; - CqTag* tag; + bool ok; + CqTag* tag = nullptr; grpc::CompletionQueue::NextStatus next_status; @@ - next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl); - - // If Shutdown() is called, return immediately. - if (next_status == grpc::CompletionQueue::SHUTDOWN) break; - - if (m_cq_closed_) { - if (tag->type == CqTag::kInitializingCraned) delete tag->craned; - continue; - } - - if (next_status == grpc::CompletionQueue::TIMEOUT) continue; - if (next_status == grpc::CompletionQueue::GOT_EVENT) { + next_status = m_cq_vec_[thread_id].AsyncNext((void**)&tag, &ok, ddl); + if (next_status == grpc::CompletionQueue::SHUTDOWN) break; + if (next_status == grpc::CompletionQueue::TIMEOUT) continue; + if (next_status == grpc::CompletionQueue::GOT_EVENT) { + if (m_cq_closed_) { + if (tag->type == CqTag::kInitializingCraned) delete tag->craned; + m_tag_sync_allocator_->delete_object(tag); + continue; + } // ... existing logic ...Also applies to: 503-506, 517-524
♻️ Duplicate comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
640-662: Resume branch integrates cleanlyPer-task result handling and messages match ResumeSuspendedTasks semantics.
🧹 Nitpick comments (7)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
272-299: Map reply.reason to specific error codes (parity with SupervisorStub)Currently both helpers collapse declines to ERR_GENERIC_FAILURE. Improve fidelity by mapping known reasons (e.g., ERR_INVALID_PARAM, ERR_NON_EXISTENT) to their codes, matching SupervisorStub’s pattern. Minimal change:
- if (reply.ok()) return CraneErrCode::SUCCESS; - CRANE_WARN("SuspendJobs RPC for Node {} declined: {}", m_craned_id_, reply.reason()); - return CraneErrCode::ERR_GENERIC_FAILURE; + if (reply.ok()) return CraneErrCode::SUCCESS; + CRANE_WARN("SuspendJobs RPC for Node {} declined: {}", m_craned_id_, reply.reason()); + if (reply.reason() == CraneErrStr(CraneErrCode::ERR_INVALID_PARAM)) return CraneErrCode::ERR_INVALID_PARAM; + if (reply.reason() == CraneErrStr(CraneErrCode::ERR_NON_EXISTENT)) return CraneErrCode::ERR_NON_EXISTENT; + return CraneErrCode::ERR_GENERIC_FAILURE;Apply similarly to ResumeJobs.
Also applies to: 301-328
164-176: Rename FreeJobs param for clarityParameter is named
taskbut represents job IDs. Rename tojob_idsto avoid confusion.-CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& task) { +CraneErrCode CranedStub::FreeJobs(const std::vector<task_id_t>& job_ids) { @@ - request.mutable_job_id_list()->Assign(task.begin(), task.end()); + request.mutable_job_id_list()->Assign(job_ids.begin(), job_ids.end());src/CraneCtld/TaskScheduler.h (1)
762-766: Consider documenting job-only contract and result orderingAdd brief comments that inputs are Job IDs (not Steps) and results align 1:1 by index. Helps callers and future maintainers.
src/Craned/Supervisor/TaskManager.cpp (3)
2005-2021: EvCleanTaskSignalQueueCb_ correctly drains queue but lacks operation guardThe dispatcher correctly drains
m_task_signal_queue_, dispatches to suspend/resume, and fulfills the promise once per element. However, unlike termination/time-limit handlers, it does not checkg_runtime_status.CanStepOperate()or step completion before acting, so suspend/resume may silently no-op on already-completing/finished steps.You might align this with
EvCleanTerminateTaskQueueCb_/EvCleanChangeTaskTimeLimitQueueCb_by:
- Early-returning or re-enqueueing when
!g_runtime_status.CanStepOperate()orm_step_.AllTaskFinished(), and- Returning an appropriate
CraneErrCodeinstead ofSUCCESSin those cases.This keeps suspend/resume semantics consistent with other step-level operations.
2023-2050: Iterate over live task instances instead ofm_step_.task_idsin SuspendRunningTasks_
m_step_.task_idsrepresents the static task-id list, whilem_step_.GetTaskIds()iterates the keys ofm_task_map_(i.e., currently instantiated tasks). Usingtask_idshere may include tasks that were never started or already cleaned up; you guard withif (task)so it won’t crash, but it does extra lookups and diverges from how termination usesGetTaskIds().Consider:
- for (const auto& task_id : m_step_.task_ids) { + for (auto task_id : m_step_.GetTaskIds()) { auto* task = m_step_.GetTaskInstance(task_id);This keeps suspend behavior aligned with the termination path and guarantees you only touch live instances.
2052-2078: Mirror live-task iteration and error aggregation in ResumeSuspendedTasks_Same observation as
SuspendRunningTasks_: iteratingm_step_.task_idswill probe tasks that may not be instantiated anymore; theif (task)guard prevents crashes, but it’s less precise than iterating live tasks. Error aggregation strategy (resultholds last failure) is acceptable given no state flip occurs here.Apply the same pattern as suggested for
SuspendRunningTasks_:- for (const auto& task_id : m_step_.task_ids) { + for (auto task_id : m_step_.GetTaskIds()) { auto* task = m_step_.GetTaskInstance(task_id);This keeps resume behavior consistent with other lifecycle handlers.
src/Craned/Core/CranedServer.cpp (1)
92-148: SuspendJobs/ResumeJobs implement job-level control as intended; consider error semanticsThe new Suspend/Resume RPCs:
- Enforce
ReadyFor(RequestSource::CTLD)with UNAVAILABLE status, consistent with nearby RPCs.- Operate strictly at job granularity (
job_id_list()), derive local steps viaGetAllocatedJobSteps(), and fan out to per-step Supervisor stubs while skipping daemon stepstep_id == 0, matching the “job-only” requirement.- Accumulate human-readable reasons and expose aggregate
okplusreasonin the reply.Behavior is functionally correct, but note that per-project learnings generally prefer returning
Status::OKeven on errors and carrying failure information in the reply to avoid disconnecting clients; these methods reuse the existing UNAVAILABLE-on-not-ready pattern instead. Based on learnings.If you want to fully align with that convention, you could in a future change:
- Always return
Status::OK, and- For the “not ready” case, set
ok = falseand a descriptivereasoninstead of returning UNAVAILABLE.Also applies to: 150-205
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
protos/Crane.proto(3 hunks)protos/PublicDefs.proto(1 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(27 hunks)src/CraneCtld/RpcService/CranedKeeper.h(7 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Core/CranedServer.cpp(14 hunks)src/Craned/Core/CranedServer.h(2 hunks)src/Craned/Core/JobManager.h(0 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(3 hunks)src/Craned/Supervisor/TaskManager.h(5 hunks)
💤 Files with no reviewable changes (1)
- src/Craned/Core/JobManager.h
🚧 Files skipped from review as they are similar to previous changes (2)
- protos/Supervisor.proto
- src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧠 Learnings (23)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/Craned/Core/CranedServer.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CranedServer.hsrc/Craned/Core/CranedServer.cppsrc/Craned/Supervisor/SupervisorServer.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/CranedServer.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/SupervisorServer.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Supervisor/TaskManager.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/SupervisorServer.h
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.h
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (7)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (5)
SuspendJobs(92-148)SuspendJobs(92-95)ResumeJobs(150-205)ResumeJobs(150-152)CranedServer(640-668)src/CraneCtld/RpcService/CranedKeeper.cpp (4)
SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (5)
ExecuteSteps(211-212)SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SuspendRunningTasks(1251-1397)SuspendRunningTasks(1251-1252)ResumeSuspendedTasks(1399-1538)ResumeSuspendedTasks(1399-1400)
src/Craned/Supervisor/SupervisorServer.h (2)
src/Craned/Core/SupervisorKeeper.cpp (4)
SuspendJob(93-111)SuspendJob(93-93)ResumeJob(113-131)ResumeJob(113-113)src/Craned/Supervisor/SupervisorServer.cpp (4)
SuspendJob(95-109)SuspendJob(95-98)ResumeJob(111-125)ResumeJob(111-114)
src/CraneCtld/RpcService/CranedKeeper.h (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (54)
CranedStub(29-35)Fini(37-39)Fini(37-37)l(460-460)ConfigureCraned(41-68)ConfigureCraned(41-42)AllocJobs(135-162)AllocJobs(135-136)FreeJobs(164-188)FreeJobs(164-164)AllocSteps(190-208)AllocSteps(190-191)job_id_t(210-244)ExecuteSteps(211-212)FreeSteps(246-270)FreeSteps(246-247)SuspendJobs(272-299)SuspendJobs(272-272)ResumeJobs(301-328)ResumeJobs(301-301)TerminateSteps(70-99)TerminateSteps(70-71)TerminateOrphanedSteps(101-133)TerminateOrphanedSteps(101-102)CranedKeeper(422-443)CranedKeeper(445-452)Shutdown(454-474)Shutdown(454-454)AvailableCranedCount(777-780)AvailableCranedCount(777-777)IsCranedConnected(782-785)IsCranedConnected(782-782)GetCranedStub(787-794)GetCranedStub(787-788)SetCranedConnectedCb(796-799)SetCranedConnectedCb(796-797)SetCranedDisconnectedCb(801-803)SetCranedDisconnectedCb(801-801)PutNodeIntoUnavailSet(805-813)PutNodeIntoUnavailSet(805-806)CranedChannelConnFailNoLock_(905-911)CranedChannelConnFailNoLock_(905-905)ConnectCranedNode_(815-903)ConnectCranedNode_(815-816)InitCranedStateMachine_(582-679)InitCranedStateMachine_(582-583)EstablishedCranedStateMachine_(681-748)EstablishedCranedStateMachine_(681-682)CheckNodeTimeoutAndClean(750-775)CheckNodeTimeoutAndClean(750-750)StateMonitorThreadFunc_(476-580)StateMonitorThreadFunc_(476-476)PeriodConnectCranedThreadFunc_(913-949)PeriodConnectCranedThreadFunc_(913-913)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
SuspendJobAsync(1687-1699)SuspendJobAsync(1687-1687)ResumeJobAsync(1701-1713)ResumeJobAsync(1701-1701)EvCleanTaskSignalQueueCb_(2005-2021)EvCleanTaskSignalQueueCb_(2005-2005)SuspendRunningTasks_(2023-2050)SuspendRunningTasks_(2023-2023)ResumeSuspendedTasks_(2052-2078)ResumeSuspendedTasks_(2052-2052)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
src/Craned/Core/CranedServer.cpp (8)
FreeJobs(395-410)FreeJobs(395-397)FreeSteps(376-393)FreeSteps(376-378)SuspendJobs(92-148)SuspendJobs(92-95)ResumeJobs(150-205)ResumeJobs(150-152)src/Craned/Core/JobManager.cpp (4)
FreeJobs(240-275)FreeJobs(240-240)FreeSteps(305-332)FreeSteps(305-306)src/CraneCtld/RpcService/CranedKeeper.h (1)
UpdateLastActiveTime(108-111)src/Craned/Core/CtldClient.h (1)
UpdateLastActiveTime(193-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (17)
protos/PublicDefs.proto (1)
120-123: TaskStatus::Suspended added — OKEnum addition looks consistent with new flows. Ensure any UI/query filters include Suspended.
protos/Crane.proto (3)
143-161: Suspend/Resume messages are minimal and clearJob-granularity clarified in comments; fields match usage sites.
254-256: ModifyTask TargetAttributes extended — OKSuspend/Resume entries align with Ctld handler branches.
1092-1095: Craned service: new SuspendJobs/ResumeJobs RPCs — OKSurface matches CranedServer implementation and Ctld caller expectations.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
617-639: Suspend branch integrates cleanlyPer-task result handling and messages look correct relative to TaskScheduler::SuspendRunningTasks.
src/Craned/Supervisor/TaskManager.cpp (2)
1494-1499: Async handle for task-signal queue is consistent with existing uvw patternInitialization and callback wiring for
m_task_signal_async_handle_mirror other async handles in this class; no behavioral concerns.
1687-1713: SuspendJobAsync/ResumeJobAsync queuing pattern looks correctPromise/future creation, move into
TaskSignalQueueElem, enqueue, and async signal follow the same pattern as other async RPC paths; lifetime and ownership look sound.src/Craned/Supervisor/TaskManager.h (1)
454-457: Public SuspendJobAsync/ResumeJobAsync surface aligns with async patternThe new public methods return
std::future<CraneErrCode>and match how other async operations (ExecuteTaskAsync, ChangeTaskTimeLimitAsync) are exposed; signatures look appropriate for callers viaSupervisorServer.src/Craned/Core/CranedServer.cpp (2)
36-44: Signature/loop normalizations are cosmetic and consistentChanges here are limited to pointer spacing (
grpc::ServerContext*), use ofconst auto&in range-for loops, and minor formatting; they don’t alter behavior and keep style consistent across RPC handlers and server construction.Also applies to: 46-50, 67-75, 207-215, 232-240, 342-370, 376-382, 395-401, 412-420, 438-446, 624-638, 640-668
430-431: Range-for in QuerySshStepEnvVariables now uses const-reference pairUsing
const auto& [name, value]when copying the env map slightly reduces copying and is idiomatic; no functional concerns.src/Craned/Core/CranedServer.h (1)
53-60: Header RPC declarations match implementation and normalize styleAll RPC signatures now consistently take
grpc::ServerContext*and const pointer request/response types, and the newSuspendJobs/ResumeJobsdeclarations align with the implementations inCranedServer.cpp. CranedServer’s constructor signature tweak is purely stylistic.Also applies to: 61-69, 71-77, 79-83, 85-87, 89-103, 105-109, 110-124, 128-129
src/CraneCtld/RpcService/CranedKeeper.h (5)
38-45: CranedStub lifetime and token handling improvements look solidMarking the constructor
explicitand deleting copy/move operations matches the ownership expectations for stubs managed byCranedKeeper.SetRegTokennow takes a const reference andCheckTokenadds logging on mismatch while remaining mutex-guarded, which should help diagnose connection/token issues without changing semantics.Also applies to: 47-52
74-80: New CranedStub::SuspendJobs/ResumeJobs APIs align with RPC layerThe stub now exposes
SuspendJobsandResumeJobstakingconst std::vector<job_id_t>&, which matches the corresponding RPC implementations in the.cppand the CTLD-level job-centric semantics. This is consistent with existing Alloc/Free/Execute/Terminate methods.Also applies to: 86-88
121-145: Internal CranedStub members and cleanup callback remain consistent with keeper usage
m_craned_keeper_pointer,m_clean_up_cb_(now typed asstd::function<void(CranedStub*)>), and the atomic/disconnected flags align with howCranedKeeper’s state machine and cleanup paths use them in the .cpp snippets; no issues spotted.
160-167: CranedKeeper interface tightening (const refs, deleted copy/move) is appropriateDeleting copy/move for
CranedKeeperand takingconst CranedId&/const RegToken&in getters and callbacks (IsCranedConnected,GetCranedStub,SetCranedConnectedCb,PutNodeIntoUnavailSet) improves safety and avoids unnecessary copies. Signatures match the corresponding definitions inCranedKeeper.cpp.Also applies to: 170-172, 184-191
194-213: State-machine helper signatures and callbacks are consistent with implementation
CqTagnow stores aCranedStub*, and helpers likeCranedChannelConnectFail_,CranedChannelConnFailNoLock_,ConnectCranedNode_,InitCranedStateMachine_,EstablishedCranedStateMachine_, andCheckNodeTimeoutAndCleanall take strongly typedCranedStub*/CranedId/RegToken¶meters. This matches the .cpp snippets and clarifies the ownership model for channel-state transitions and connection lifecycle.Also applies to: 217-250
src/Craned/Supervisor/SupervisorServer.h (1)
63-71: LGTM! New suspend/resume RPC handlers follow established patterns.The method declarations are correctly structured and consistent with other gRPC service methods in this class. The implementations (visible in the relevant code snippets) properly return
Status::OKand delegate to the task manager's async methods, which aligns with the system's error handling approach.Note: The naming uses "Job" terminology (SuspendJob/ResumeJob) rather than "Task" (like TerminateTask), which aligns with the PR comment requirement that these operations should only support Job objects.
src/Craned/Supervisor/TaskManager.h
Outdated
| virtual CraneErrCode Suspend(); | ||
| virtual CraneErrCode Resume(); | ||
| virtual CraneErrCode Cleanup() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Virtual Suspend/Resume declared but no visible definitions for base or ContainerInstance override
ITaskInstance now declares virtual CraneErrCode Suspend(); and Resume();, and ContainerInstance overrides them, but there are no corresponding definitions in this header or in TaskManager.cpp as provided. Since TaskManager::SuspendRunningTasks_() / ResumeSuspendedTasks_() invoke task->Suspend() / task->Resume(), this will lead to undefined references at link time unless these methods are defined in another translation unit.
To make this safe and explicit:
class ITaskInstance {
…
- virtual CraneErrCode Prepare() = 0;
- virtual CraneErrCode Spawn() = 0;
- virtual CraneErrCode Kill(int signum) = 0;
- virtual CraneErrCode Suspend();
- virtual CraneErrCode Resume();
- virtual CraneErrCode Cleanup() = 0;
+ virtual CraneErrCode Prepare() = 0;
+ virtual CraneErrCode Spawn() = 0;
+ virtual CraneErrCode Kill(int signum) = 0;
+ // Provide a concrete default or make these pure virtual and implement in all subclasses.
+ virtual CraneErrCode Suspend();
+ virtual CraneErrCode Resume();
+ virtual CraneErrCode Cleanup() = 0;and then add concrete implementations (e.g., in TaskManager.cpp) for:
CraneErrCode ITaskInstance::Suspend()/Resume()(default behavior or error), andCraneErrCode ContainerInstance::Suspend()/Resume()(container-specific behavior), plusProcInstance::Suspend()/Resume()if processes also support suspend/resume.
If these definitions already exist in another file, ensure they’re committed and that their signatures match these declarations exactly.
Also applies to: 255-257
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.h around lines 221-223 (and similarly
255-257), ITaskInstance now declares virtual CraneErrCode Suspend() and Resume()
and ContainerInstance/ProcInstance override them but no definitions are present,
causing potential undefined references; add concrete definitions in the
corresponding .cpp (e.g., TaskManager.cpp or a new TaskInstance.cpp): implement
ITaskInstance::Suspend() and ITaskInstance::Resume() with a safe default return
(e.g., a not-supported error or success), and implement
ContainerInstance::Suspend()/Resume() and ProcInstance::Suspend()/Resume() with
the container/process-specific suspend/resume logic; ensure the function
signatures exactly match the header, compile into the same target, and commit
the new/updated .cpp file so linkers find the symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (1)
2023-2050: Add status check before suspending tasks.The method should verify that the step is in a
Runningstate before attempting to suspend tasks. Without this check, the method might attempt to suspend tasks inPending,Completing, or other invalid states, leading to unexpected behavior.Apply this diff to add the status check:
CraneErrCode TaskManager::SuspendRunningTasks_() { CRANE_DEBUG("Suspending all running tasks"); + if (g_runtime_status.Status.load() != StepStatus::Running) { + CRANE_DEBUG("Step is not in Running state, cannot suspend."); + return CraneErrCode::ERR_INVALID_PARAM; + } + // Daemon steps don't have actual task instances to suspendBased on learnings.
src/CraneCtld/TaskScheduler.cpp (1)
1297-1313: Use a real step id (kDaemonStepId) instead of0inStepStatusChangeAsync.
handle_failurein both suspend and resume paths calls:
StepStatusChangeAsync(task_id, 0, craned_id, ...)
CleanTaskStatusChangeQueueCb_looks up a step bystep_id(daemon or common step) and drops unknown steps as “Ignoring unknown step…”, so step id0will never be matched. That means these synthetic failure notifications won’t drive job cleanup: resources remain allocated andm_running_task_map_may keep the job indefinitely if no other status change arrives. This is the same issue previously raised for the suspend/resume failure path.To align with existing failure injection (see
TerminateTasksOnCraned, which useskDaemonStepId), use the daemon step id here as well:- for (const auto& craned_id : executing_nodes) { - StepStatusChangeAsync( - task_id, 0, craned_id, crane::grpc::TaskStatus::Failed, - ExitCode::EC_RPC_ERR, "", - google::protobuf::util::TimeUtil::GetCurrentTime()); - } + for (const auto& craned_id : executing_nodes) { + StepStatusChangeAsync( + task_id, kDaemonStepId, craned_id, + crane::grpc::TaskStatus::Failed, + ExitCode::EC_RPC_ERR, "", + google::protobuf::util::TimeUtil::GetCurrentTime()); + }Apply the same change in both
SuspendRunningTasksandResumeSuspendedTasks.Based on learnings
Also applies to: 1440-1456
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Supervisor/TaskManager.cpp(3 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.h
🧬 Code graph analysis (2)
src/CraneCtld/TaskScheduler.cpp (3)
src/CraneCtld/TaskScheduler.h (1)
TerminateRunningTask(849-857)src/Craned/Core/JobManager.cpp (2)
StepStatusChangeAsync(1231-1240)StepStatusChangeAsync(1231-1235)src/Craned/Core/CtldClient.cpp (2)
StepStatusChangeAsync(564-573)StepStatusChangeAsync(564-565)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
SuspendJobAsync(1687-1699)SuspendJobAsync(1687-1687)ResumeJobAsync(1701-1713)ResumeJobAsync(1701-1701)EvCleanTaskSignalQueueCb_(2005-2021)EvCleanTaskSignalQueueCb_(2005-2005)SuspendRunningTasks_(2023-2050)SuspendRunningTasks_(2023-2023)ResumeSuspendedTasks_(2052-2078)ResumeSuspendedTasks_(2052-2052)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (7)
src/Craned/Supervisor/TaskManager.h (4)
255-257: Ensure ContainerInstance suspend/resume uses CRI pause/unpause APIs.Container suspend/resume should leverage CRI runtime pause/unpause capabilities rather than POSIX signals (SIGSTOP/SIGCONT). Verify that the implementations call the appropriate CRI client methods (e.g.,
PauseContainer/UnpauseContainer) instead of routing throughKill(), which performs destructive teardown operations.Based on learnings.
479-483: LGTM!The
TaskSignalQueueElemstructure is properly defined with theActionenum and promise field, addressing the previous concern about missing type definitions.
454-457: LGTM: Async API design follows existing patterns.The suspend/resume API design is consistent with the existing async patterns in TaskManager (e.g.,
ChangeTaskTimeLimitAsync). The use of promise/future, concurrent queue, and uvw async handle ensures thread-safe operation.Also applies to: 500-500, 505-506, 533-534
221-222: Based on my thorough investigation, I found a critical issue that requires your verification:ITaskInstance::Suspend() and Resume() lack visible implementations, yet code calls them polymorphically.
Key findings:
ITaskInstance base class (lines 221-222): Declares
virtual CraneErrCode Suspend();andvirtual CraneErrCode Resume();as non-pure virtual (no= 0), meaning they require implementations.ContainerInstance (lines 255-256): Declares
CraneErrCode Suspend() override;andCraneErrCode Resume() override;but no implementations found in the codebase.ProcInstance (lines 315-382): Does NOT declare or override Suspend/Resume, so it depends entirely on base class implementations.
Runtime calls (TaskManager.cpp:2038, 2067): Code calls
task->Suspend()andtask->Resume()polymorphically on ITaskInstance pointers.Despite exhaustive searches across all
.cppfiles insrc/Craned/Supervisor/, no implementations with opening braces were found for either the base class or derived classes.This should cause linker errors if implementations are genuinely missing. Since the repository apparently builds successfully, implementations likely exist but weren't detected. They may be:
- In a separate source file outside
src/Craned/Supervisor/- Defined through macro expansion
- Provided as inline empty stubs elsewhere
src/Craned/Supervisor/TaskManager.cpp (3)
1494-1498: LGTM!The async handle initialization follows the established pattern used for other async operations in the TaskManager constructor.
1687-1713: LGTM!Both
SuspendJobAsync()andResumeJobAsync()correctly implement the async pattern: enqueue a signal element with a promise, trigger the async handle, and return a future. The implementation is thread-safe and consistent with other async methods.
2005-2021: LGTM!The queue processing callback correctly dispatches to the appropriate suspend/resume helper and fulfills the promise with the result. The default case properly handles unexpected action values.
| if (iter == m_running_task_map_.end()) { | ||
| CRANE_TRACE("Task #{} not in Rn queue for suspend", task_id); | ||
| results.emplace_back(CraneErrCode::ERR_INVALID_PARAM); | ||
| continue; | ||
| } | ||
|
|
||
| TaskInCtld* task = iter->second.get(); | ||
| if (task->Status() == crane::grpc::Suspended) { | ||
| CRANE_TRACE("Task #{} already suspended", task_id); | ||
| results.emplace_back(CraneErrCode::ERR_INVALID_PARAM); | ||
| continue; | ||
| } | ||
| if (task->Status() != crane::grpc::Running) { | ||
| CRANE_TRACE("Task #{} is not running (status {}) for suspend", task_id, | ||
| static_cast<int>(task->Status())); | ||
| results.emplace_back(CraneErrCode::ERR_INVALID_PARAM); | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align error codes for “task not found” vs. “bad state” between suspend and resume.
In SuspendRunningTasks a missing task in m_running_task_map_ returns ERR_INVALID_PARAM, while ResumeSuspendedTasks returns ERR_NON_EXISTENT for the same condition. Both functions also use ERR_INVALID_PARAM for “wrong current status” (already suspended / not running).
For consistency with other APIs (e.g. ChangeTaskTimeLimit) and clearer client behavior, it would be better to:
- Use
ERR_NON_EXISTENTwhen the task id is not present. - Reserve
ERR_INVALID_PARAMfor “task exists but is in an incompatible state”.
That implies changing the suspend branch for iter == end() to return ERR_NON_EXISTENT.
Also applies to: 1409-1420
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1261-1277 (and similarly for
lines 1409-1420), the branch that handles a missing task in m_running_task_map_
currently returns CraneErrCode::ERR_INVALID_PARAM; change that to
CraneErrCode::ERR_NON_EXISTENT to match ResumeSuspendedTasks and other APIs,
while leaving the branches that detect an existing task in an incompatible state
(already suspended / not running) returning ERR_INVALID_PARAM unchanged; apply
the same change in the second location (1409-1420).
| CraneErrCode TaskManager::ResumeSuspendedTasks_() { | ||
| CRANE_DEBUG("Resuming all suspended tasks"); | ||
|
|
||
| // Daemon steps don't have actual task instances to resume | ||
| // Note: This should not be reached as daemon steps are filtered at RPC level | ||
| if (m_step_.IsDaemonStep()) { | ||
| CRANE_DEBUG("Daemon step has no tasks to resume, skipping."); | ||
| return CraneErrCode::SUCCESS; | ||
| } | ||
|
|
||
| CraneErrCode result = CraneErrCode::SUCCESS; | ||
|
|
||
| for (const auto& task_id : m_step_.task_ids) { | ||
| auto* task = m_step_.GetTaskInstance(task_id); | ||
| if (task) { | ||
| CraneErrCode err = task->Resume(); | ||
| if (err != CraneErrCode::SUCCESS) { | ||
| CRANE_ERROR("Failed to resume task #{}: {}", task_id, CraneErrStr(err)); | ||
| result = err; | ||
| } else { | ||
| CRANE_DEBUG("Task #{} resumed successfully", task_id); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add status check before resuming tasks.
Similar to suspend, the resume operation should verify that the step is in an appropriate state (e.g., Suspended) before attempting to resume tasks. This prevents attempting to resume tasks that are not suspended or are in other invalid states.
Apply this diff to add the status check:
CraneErrCode TaskManager::ResumeSuspendedTasks_() {
CRANE_DEBUG("Resuming all suspended tasks");
+ if (g_runtime_status.Status.load() != StepStatus::Suspended) {
+ CRANE_DEBUG("Step is not in Suspended state, cannot resume.");
+ return CraneErrCode::ERR_INVALID_PARAM;
+ }
+
// Daemon steps don't have actual task instances to resumeBased on learnings.
🤖 Prompt for AI Agents
In src/Craned/Supervisor/TaskManager.cpp around lines 2052 to 2078, add a state
check before resuming tasks: verify the step is in the Suspended state (e.g.,
via m_step_.GetStatus() == StepStatus::Suspended or the project’s equivalent
accessor) and if not, log a debug/error and return an appropriate CraneErrCode
(e.g., CraneErrCode::INVALID_STATE or CraneErrCode::SUCCESS if you prefer a
no-op) instead of iterating over task_ids; only enter the loop and call
task->Resume() when the step is confirmed Suspended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (2)
2108-2134: Missing state validation before resume operation.Similar to suspend, this method should verify that the step is in
Suspendedstate before attempting to resume tasks.Consider adding a state check:
CraneErrCode TaskManager::ResumeSuspendedTasks_() { CRANE_DEBUG("Resuming all suspended tasks"); + if (g_runtime_status.Status.load() != StepStatus::Suspended) { + CRANE_DEBUG("Step is not in Suspended state (current: {}), cannot resume.", + g_runtime_status.Status.load()); + return CraneErrCode::ERR_INVALID_PARAM; + } + // Daemon steps don't have actual task instances to resumeBased on learnings.
2079-2106: Missing state validation before suspend operation.The method doesn't verify that the step is in
Runningstate before attempting to suspend tasks. While daemon steps are filtered, other invalid states (e.g.,Completing,Pending) should also be checked to prevent unexpected behavior.Consider adding a state check at the beginning:
CraneErrCode TaskManager::SuspendRunningTasks_() { CRANE_DEBUG("Suspending all running tasks"); + if (g_runtime_status.Status.load() != StepStatus::Running) { + CRANE_DEBUG("Step is not in Running state (current: {}), cannot suspend.", + g_runtime_status.Status.load()); + return CraneErrCode::ERR_INVALID_PARAM; + } + // Daemon steps don't have actual task instances to suspendBased on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/Craned/Supervisor/TaskManager.cpp(5 hunks)src/Craned/Supervisor/TaskManager.h(7 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (2)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
Suspend(1393-1405)Suspend(1393-1393)Suspend(1432-1444)Suspend(1432-1432)Resume(1407-1419)Resume(1407-1407)Resume(1446-1458)Resume(1446-1446)SuspendJobAsync(1743-1755)SuspendJobAsync(1743-1743)ResumeJobAsync(1757-1769)ResumeJobAsync(1757-1757)EvCleanTaskSignalQueueCb_(2061-2077)EvCleanTaskSignalQueueCb_(2061-2061)SuspendRunningTasks_(2079-2106)SuspendRunningTasks_(2079-2079)ResumeSuspendedTasks_(2108-2134)ResumeSuspendedTasks_(2108-2108)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Supervisor/TaskManager.h (3)
Suspend(255-255)m_pid_(380-380)Resume(256-256)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (8)
src/Craned/Supervisor/TaskManager.h (4)
221-222: LGTM!The pure virtual
Suspend()andResume()methods are correctly declared in the base interface, with appropriate overrides in derived classes (ContainerInstanceandProcInstance). Implementations are present in the.cppfile.
255-256: Container suspend/resume returns ERR_NOT_SUPPORTED - acceptable for this PR.Containers do not currently support suspend/resume operations. This is a reasonable limitation given the PR scope focuses on process-based job suspend/resume. Future enhancement could add CRI-level pause/unpause support.
456-458: LGTM!The asynchronous suspend/resume API follows the established pattern in
TaskManager(consistent withExecuteTaskAsync,ChangeTaskTimeLimitAsync, etc.) and properly returns futures for error handling.
481-485: LGTM!The
TaskSignalQueueElemstructure is now properly defined (addresses previous review concern). The design with an action enum and promise for async result delivery is clean and follows the existing queue element patterns in this class.src/Craned/Supervisor/TaskManager.cpp (4)
1393-1419: Process suspend/resume implementation is correct.The implementations correctly use:
kill(-m_pid_, SIGSTOP)andkill(-m_pid_, SIGCONT)to signal the entire process group (negative PID)- Appropriate error checking and return codes
- Guard against zero PID
1550-1554: LGTM!The async handle initialization for task signal processing follows the established pattern used for other async operations in
TaskManager.
1743-1769: LGTM!Both
SuspendJobAsync()andResumeJobAsync()correctly implement the async queue pattern:
- Create promise/future pair
- Enqueue request with promise
- Trigger async handler
- Return future to caller
2061-2077: LGTM!The queue callback correctly dispatches suspend/resume actions to the appropriate helper methods and fulfills the associated promises with results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1566-1571: Suspend/resume queue wiring is fine, but step state should be validated before actingThe async wiring—
m_task_signal_async_handle_drivingEvCleanTaskSignalQueueCb_(), which dispatchesSuspend/Resumeactions toSuspendRunningTasks_()/ResumeSuspendedTasks_()—is consistent with the rest of TaskManager’s design and looks correct from a concurrency perspective.However, unlike terminate/time-limit operations,
SuspendRunningTasks_()andResumeSuspendedTasks_()do not check the step’s runtime status (e.g., Running vs Suspended vs Completing) before sending signals. That means suspend/resume can be invoked while the step is in an incompatible state, which may conflict with the intended lifecycle gating enforced elsewhere viag_runtime_statusandCanStepOperate().Consider:
- For suspend: ensure the step is currently Running (and not Completing/Completed/Pending) before iterating tasks.
- For resume: ensure the step is currently Suspended before attempting to resume.
Aligning these helpers with the existing state checks used in terminate/time-limit handlers would make suspend/resume behavior more predictable and avoid surprising no-op or partial effects when the step is mid-transition.
Also applies to: 1759-1785, 2077-2150
🧹 Nitpick comments (4)
src/Craned/Supervisor/TaskManager.h (1)
456-458: Async suspend/resume plumbing fits existing queue/event patternThe new
TaskSignalQueueElem,m_task_signal_async_handle_, andSuspendJobAsync/ResumeJobAsyncmirror the existing async patterns (ChangeTaskTimeLimit, terminate queue, etc.) and look correct: promises are moved into the queue element and fulfilled in the event-loop callback.One thing to keep in mind is the semantic naming:
SuspendJobAsync/ResumeJobAsyncoperate at the step/supervisor level here. If you intend suspend/resume to be job-only at the external API, it may be worth clarifying in comments or docs that this is the per-step handler invoked for a job-level operation.Also applies to: 481-485, 502-503, 507-509, 535-537
src/Craned/Common/CgroupManager.cpp (2)
32-150: Freezer implementation and public Freeze/Thaw API look good, but rely on cg_path being a controller-agnostic pathThe new
ChangeFreezerStateByPath()plusFreezeCgroupByPath/ThawCgroupByPathcorrectly:
- Handle v2 via
cg_path / "cgroup.freeze"with a short-circuit when the state is already desired.- Handle v1 by mapping a generic cgroup path under
kSystemCgPathPrefixto thefreezerhierarchy and usingfreezer.state, again with a short-circuit when alreadyFROZEN/THAWED.This is a solid design, but it assumes that
cg_pathis a controller-agnostic path underkSystemCgPathPrefix(e.g.,/sys/fs/cgroup/crane/job_...), not an already controller-specific path (like/sys/fs/cgroup/freezer/...). Passing a controller-specific path would effectively prepend the controller twice.As long as all callers (e.g.,
StepInstance::cgroup_pathused in Supervisor suspend/resume) follow the generic-path convention, this is fine; it’s worth keeping that invariant in mind when adding future call sites.Also applies to: 894-900
832-892: OOM-counts-by-path helper aligns with freezer path assumptions
ReadOomCountsFromCgroupPath()now:
- For v2, directly uses the provided
cg_pathand readsmemory.events.- For v1, derives a memory-controller-specific path under
kSystemCgPathPrefixand readsmemory.oom_control.This matches the same “controller-agnostic
cg_path” assumption as the freezer helpers. The logic itself is straightforward and preserves the previous behavior of zeroing outputs and returningfalseon any failure. If you ever need to support callers passing controller-specific paths, you may want to factor this to reuseBuildControllerSpecificPath()and explicitly validate the relationship, but it’s not required given current usage.src/Craned/Common/CgroupManager.h (1)
113-123: New freezer and BPF constants are coherent with implementationThe additions for:
kFreezerStateFileV1/kCgroupFreezeFileV2, and- BPF paths/names (
kBpfObjectFilePath,kBpfDeviceMapFilePath,kBpfMapName,kBpfProgramName)match the usage patterns in CgroupManager.cpp. Hard-coding these paths is consistent with the existing design; if deployment environments diverge, configuration knobs might be a future enhancement, but nothing is blocking here.
Also applies to: 307-337
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/Craned/Common/CgroupManager.cpp(41 hunks)src/Craned/Common/CgroupManager.h(16 hunks)src/Craned/Supervisor/TaskManager.cpp(5 hunks)src/Craned/Supervisor/TaskManager.h(7 hunks)
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.
Applied to files:
src/Craned/Common/CgroupManager.cppsrc/Craned/Common/CgroupManager.h
📚 Learning: 2025-04-02T09:52:59.318Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:697-701
Timestamp: 2025-04-02T09:52:59.318Z
Learning: When using bpf_map__get_next_key function, memory must be properly allocated (e.g., with std::make_unique<BpfKey>()) before passing the pointer to the function, as it writes the key to the provided memory address.
Applied to files:
src/Craned/Common/CgroupManager.cppsrc/Craned/Common/CgroupManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Common/CgroupManager.cpp
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
Applied to files:
src/Craned/Common/CgroupManager.h
📚 Learning: 2025-04-18T02:26:16.113Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
Applied to files:
src/Craned/Common/CgroupManager.h
🧬 Code graph analysis (4)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (18)
Suspend(786-806)Suspend(786-786)Suspend(1437-1449)Suspend(1437-1437)Resume(808-828)Resume(808-808)Resume(1451-1463)Resume(1451-1451)SuspendJobAsync(1759-1771)SuspendJobAsync(1759-1759)ResumeJobAsync(1773-1785)ResumeJobAsync(1773-1773)EvCleanTaskSignalQueueCb_(2077-2093)EvCleanTaskSignalQueueCb_(2077-2077)SuspendRunningTasks_(2095-2122)SuspendRunningTasks_(2095-2095)ResumeSuspendedTasks_(2124-2150)ResumeSuspendedTasks_(2124-2124)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Supervisor/TaskManager.h (2)
GetParentStepInstance(210-210)m_pid_(380-380)src/Craned/Common/CgroupManager.cpp (4)
FreezeCgroupByPath(894-896)FreezeCgroupByPath(894-894)ThawCgroupByPath(898-900)ThawCgroupByPath(898-898)
src/Craned/Common/CgroupManager.cpp (1)
src/Craned/Common/CgroupManager.h (1)
handle(413-414)
src/Craned/Common/CgroupManager.h (1)
src/Craned/Common/CgroupManager.cpp (18)
ReadOomCountsFromCgroupPath(832-892)ReadOomCountsFromCgroupPath(832-834)FreezeCgroupByPath(894-896)FreezeCgroupByPath(894-894)ThawCgroupByPath(898-900)ThawCgroupByPath(898-898)CreateOrOpen_(437-584)CreateOrOpen_(437-439)GetJobIdsFromCgroupV2_(680-700)GetJobIdsFromCgroupV2_(680-681)task_id_t(735-775)GetJobBpfMapCgroupsV2_(736-737)ParseIdsFromCgroupStr_(410-427)ParseIdsFromCgroupStr_(410-411)InitializeController_(348-391)InitializeController_(348-351)GetCgJobIdMapCgroupV2_(702-731)GetCgJobIdMapCgroupV2_(702-703)
🔇 Additional comments (5)
src/Craned/Supervisor/TaskManager.h (1)
221-222: Suspend/Resume interface and overrides are consistent and safeMaking
Suspend()/Resume()pure virtual onITaskInstanceand overriding them in bothContainerInstanceandProcInstancematches the concrete implementations in TaskManager.cpp and keeps the abstraction clean. No issues from an interface or inheritance standpoint.Also applies to: 255-256, 330-331
src/Craned/Supervisor/TaskManager.cpp (2)
786-828: Container suspend/resume via cgroup freezer looks correctUsing
CgroupManager::FreezeCgroupByPath/ThawCgroupByPathonstep->cgroup_pathwith existence checks and clear logging is a reasonable container suspend/resume implementation for both cgroup v1 and v2. ReturningERR_NON_EXISTENTwhencgroup_pathis empty or missing should give callers a clear signal that suspend/resume isn’t available for this step.
1437-1463: Process-based suspend/resume via SIGSTOP/SIGCONT is straightforward
ProcInstance::Suspend()/Resume()sendSIGSTOP/SIGCONTto the process group (-m_pid_), handle errors, and distinguish between “no pid” and generic failure. This matches the expected semantics for pausing/resuming batch/interactive tasks without changing existing kill behavior.src/Craned/Common/CgroupManager.cpp (1)
653-775: Const-correctness and BPF/resource allocator adjustments are consistent with existing designThe various updates here:
- Using
std::string_view::data()for controller names passed into libcgroup (cgroup_walk_tree_*,cgroup_get_procs) and keeping the views backed by static arrays.- The BPF helpers (
GetJobBpfMapCgroupsV2_, BpfRuntimeInfo methods) now work withstd::make_unique<BpfKey>()for key storage and reuse, consistent with prior learnings.AllocatableResourceAllocator::AllocateandDedicatedResourceAllocator::Allocatenow take resources byconst&(both internal and gRPC types), with no change to semantics.All of this looks internally consistent and doesn’t introduce new behavioral risks.
Also applies to: 1150-1728
src/Craned/Common/CgroupManager.h (1)
340-377: Header declarations for cgroup classes and freezer API match the cpp and improve const-correctness
Cgroup,CgroupInterface,CgroupV1, andCgroupV2constructors now takeconst std::string&and explicitstruct cgroup*, matching the implementations.SetDeviceAccessnow takesconst std::unordered_set<SlotId>&, and allocator methods take resources byconst&, avoiding unnecessary copies.- The new
CgroupManager::FreezeCgroupByPath/ThawCgroupByPathdeclarations align with the definitions in the .cpp.Everything here is consistent with the implementation and preserves the existing public API semantics while making usage slightly cleaner.
Also applies to: 378-390, 497-503, 561-565, 586-590
Summary by CodeRabbit