-
Notifications
You must be signed in to change notification settings - Fork 32
feat: dependency #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: dependency #742
Conversation
📝 WalkthroughWalkthroughAdds end-to-end job-dependency support: protobuf additions and error codes, per-task dependency state and dependents, scheduler wiring (DB lookup, dependency-event queue, lifecycle-triggered events), submission futures/promises now carry Changes
Sequence Diagram(s)sequenceDiagram
participant Client as gRPC Client
participant GrpcServer as Ctld gRPC Server
participant Scheduler as TaskScheduler
participant DB as MongoDB
participant Queue as DependencyEventQueue
participant Task as TaskInCtld
Client->>GrpcServer: Submit task (with `dependencies`)
GrpcServer->>Scheduler: SubmitTaskToScheduler(task)
Scheduler->>Task: SetDependency(grpc_deps)
Task-->>Scheduler: dependency state initialized
Scheduler->>Scheduler: build dependee→dependents map
alt missing dependees found
Scheduler->>DB: FetchJobStatus(dependee_ids)
DB-->>Scheduler: statuses & timestamps
Scheduler->>Queue: Enqueue computed dependency events
end
Scheduler->>Scheduler: Check Dependencies().is_met(now)
alt deps met
Scheduler->>Scheduler: schedule task (enqueue/run)
else deps not met
Scheduler-->>Task: remain pending (PendingReason: Dependency / DependencyNeverSatisfied)
end
Note over Task,Scheduler: lifecycle events trigger dependency events
Task->>Scheduler: start/finish/cancel (timestamp & status)
Scheduler->>Queue: Enqueue AFTER/AFTER_OK/AFTER_NOT_OK/AFTER_ANY events
Queue->>Task: UpdateDependency(dependee_id, event_time)
Task-->>Scheduler: dependency updated → may transition to schedulable
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
409-421: Batch submit now cleanly propagates both sync and async errors; verify reply layoutThe single‑task and batch submit paths now both unwrap
CraneExpected<std::future<CraneExpected<task_id_t>>>and:
- on inner success: set
ok=true/append realtask_id,- on inner failure: set
ok=false/appendtask_id=0andtask_result.error(),- on outer failure: set
ok=false/appendtask_id=0andresult.error().This gives per‑task visibility into both submission‑time and scheduling‑time failures, which matches the new dependency/DB error codes.
One thing to double‑check on your side: clients’ expectations about
SubmitBatchTasksReply’stask_id_listandcode_list:
- Current logic appends a code only for failed submissions, so
task_id_list.size()may be larger thancode_list.size()when there is a mix of success and failure.- If the proto/clients treat these lists as parallel arrays by index, you may instead want to always append a code (e.g.
SUCCESSfor successful tasks) to keep them aligned.If the existing contract is “codes only for failed entries”, then this implementation is already consistent.
Also applies to: 443-469
src/CraneCtld/CtldPublicDefs.h (1)
414-430: Dependency tracking model inTaskInCtld/DependenciesInJobis coherent with the proto semanticsThe new dependency machinery hangs together well:
DependenciesInJob::is_met()andis_failed()combined withupdate()(min/max overevent_time + delay) correctly implement:
- AND semantics (all deps must fire; any failure makes the job permanently unschedulable),
- OR semantics (first satisfying dep wins; only “all failed” yields a failure state).
- Default state (
depsempty,is_or=false,ready_time=InfinitePast) makes tasks without dependencies immediately eligible.- The
dependents[DependencyType_ARRAYSIZE]fan‑out plusAddDependent/TriggerDependencyEventskeep the “who depends on me” relation local to the producer job and delegate actual wake‑ups throughg_task_scheduler->AddDependencyEvent(), which is a clean separation.- The special‑case in
AddDependentforAFTERwhen the job is alreadyRunning(immediately triggering withstart_time) matches “after job start” semantics and avoids storing unnecessary dependents.As a tiny readability nit, you could make
is_failed()compareready_time == absl::InfiniteFuture()instead of>=to make the intent explicit, but functionally it’s equivalent.Also applies to: 735-736, 748-748, 923-928
src/CraneCtld/CtldPublicDefs.cpp (2)
989-1004: Potential overwriting of multiple dependency conditions per job-idTwo small points to double‑check:
- In
SetDependency, you indexdependencies.depsonly bydep.job_id():dependencies.deps[dep.job_id()] = {dep.type(), dep.delay_seconds()};If the proto ever allows multiple dependency conditions on the same job ID (e.g., different
DependencyTypeentries for the same dependee), later entries silently overwrite earlier ones. If that’s not intended, consider either:
- Rejecting such input explicitly, or
- Extending the internal structure to support multiple conditions per dependee.
SetDependencydoesn’t cleardependencies.depsbefore repopulating. Today it’s only called on fresh or reconstructedTaskInCtldinstances, but if it’s ever reused, stale entries could leak. A defensive clear would make the API safer:void TaskInCtld::SetDependency(const crane::grpc::Dependencies& grpc_deps) { - if (grpc_deps.is_or()) { + dependencies.deps.clear(); + + if (grpc_deps.is_or()) { dependencies.is_or = true; dependencies.ready_time = absl::InfiniteFuture(); } else { dependencies.is_or = false; dependencies.ready_time = absl::InfinitePast(); } ... }Also,
SetFieldsOfTaskInfoemitsdependency_statuswheneverdepsis non‑empty orready_time != absl::InfinitePast(). This assumes the defaultready_timeis exactlyInfinitePast; otherwise tasks without dependencies might still show a dependency block. Worth ensuring the default matches that assumption.Also applies to: 1178-1199
1006-1022:AddDependent/TriggerDependencyEventssemantics and lifecycleThe overall design (dependents stored on the dependee; events fanned out via
TriggerDependencyEventsintoTaskScheduler::AddDependencyEvent) is sound, and the “already running AFTER” fast‑path usingstart_timecorrectly preserves delay semantics.Two minor observations:
- The comment typo “aready satisfied” in
AddDependentshould be fixed when you next touch this code.TriggerDependencyEventsnever clearsdependents[dep_type]. Today each(dep_type, dependee)is only triggered once, so it’s harmless, but if future changes made events fire multiple times per type,DependenciesInJob::updatewould start logging spurious “Dependency for job not found” errors. Clearing after use would be safer:void TaskInCtld::TriggerDependencyEvents( const crane::grpc::DependencyType& dep_type, absl::Time event_time) { - for (task_id_t dependent_id : dependents[dep_type]) { + auto& vec = dependents[dep_type]; + for (task_id_t dependent_id : vec) { g_task_scheduler->AddDependencyEvent(dependent_id, task_id, event_time); } + vec.clear(); }Not urgent, but would reduce risk of noisy logs if the triggering pattern evolves.
src/CraneCtld/TaskScheduler.cpp (2)
399-487: Recovery-time dependency reconstruction and DB backfillThe recovery block that:
- Builds
dependee_to_dependentsfromm_pending_task_map_,- Hooks dependents to in-memory pending/running jobs via
AddDependent, and- For missing dependees, queries MongoDB and enqueues
DependencyEvents with appropriate timestamps (orInfiniteFuture),is a solid way to restore dependency state after restart.
A couple of behavioral notes to be aware of:
- For
AFTERdependencies you fall back tocanceltime whentime_start == 0. That effectively treats a job cancelled before it ever ran as satisfyingAFTER. If your intent is “AFTER job leaves the queue” rather than “AFTER job started”, this is correct; otherwise it may surprise users.- For
AFTER_ANY, whentime_end == 0you useabsl::Now()and log an error. That’s a reasonable best-effort fallback, but the error log will be noisy if this can happen in normal operation.If the semantics above are intentional, the implementation looks good; otherwise it might be worth tightening the conditions or distinguishing “never started” from “ended”.
1521-1587: End-to-endCraneExpectedsubmission flow is reasonable
SubmitTaskToSchedulernow:
- Performs all synchronous checks (UID validity, account/partition permission, QoS enablement, partition allowlist, optional fields, QoS limits) and returns
std::unexpected(err)on failure.- On success, calls
TryMallocQosResource, logs and returnsstd::unexpected(res)if the QoS pool is exhausted.- Only then calls
g_task_scheduler->SubmitTaskAsync(std::move(task))and returns the resultingstd::future<CraneExpected<task_id_t>>.This separation of sync vs async errors is clear. One minor stylistic nit: since you’re inside
TaskScheduler, callingSubmitTaskAsync(std::move(task))directly would avoid the globalg_task_schedulerindirection, but functionally it’s equivalent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.cpp(5 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(1 hunks)src/CraneCtld/DbClient.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(13 hunks)src/CraneCtld/TaskScheduler.h(5 hunks)
🧰 Additional context used
🧠 Learnings (17)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 625
File: src/CraneCtld/TaskScheduler.cpp:2811-2874
Timestamp: 2025-09-22T07:13:20.635Z
Learning: In the CraneSched codebase, there are latch mechanisms used to ensure variable lifetime safety when capturing references in detached thread pool tasks, allowing reference captures that might otherwise be unsafe.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/DbClient.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/TaskScheduler.h
🧬 Code graph analysis (4)
src/CraneCtld/DbClient.h (1)
src/CraneCtld/DbClient.cpp (2)
FetchJobStatus(498-548)FetchJobStatus(500-500)
src/CraneCtld/DbClient.cpp (2)
src/Craned/Supervisor/TaskManager.h (1)
job_id(51-51)src/Craned/Supervisor/SupervisorPublicDefs.h (1)
exit_code(37-37)
src/CraneCtld/CtldPublicDefs.h (1)
src/CraneCtld/CtldPublicDefs.cpp (10)
update(122-137)update(122-122)SetDependency(989-1000)SetDependency(989-989)UpdateDependency(1002-1004)UpdateDependency(1002-1002)AddDependent(1006-1015)AddDependent(1006-1007)TriggerDependencyEvents(1017-1022)TriggerDependencyEvents(1017-1018)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
AddDependencyEvent(881-887)
🪛 Cppcheck (2.18.0)
src/CraneCtld/DbClient.cpp
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
src/CraneCtld/CtldPublicDefs.cpp
[information] 22-22: Include file
(missingIncludeSystem)
[information] 22-22: Include file
(missingInclude)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
src/CraneCtld/TaskScheduler.cpp
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
🔇 Additional comments (13)
src/CraneCtld/DbClient.h (1)
122-125: NewFetchJobStatusAPI signature looks consistentReturn type and parameters match the implementation and existing job-related methods; the comment clearly documents the tuple layout. No issues from the header side.
protos/PublicDefs.proto (1)
135-162: Dependency proto extensions are structurally sound and backward‑compatibleThe new
DependencyType/DependencyCondition/Dependencies/DependencyStatusmessages and the added fields onTaskToCtld(15) andTaskInfo(29) use fresh field numbers and match the dependency model implemented inCtldPublicDefs(OR/AND viais_orplus ready-time and infinite sentinels). New error codesERR_MISSING_DEPENDENCYandERR_DB_INSERT_FAILEDfit naturally intoErrCodewithout breaking existing values.No structural or compatibility issues spotted here.
Also applies to: 186-187, 482-485, 730-731
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
301-315: NestedCraneExpectedhandling inCforedStreamlooks correctThe new unwrapping of
CraneExpected<std::future<CraneExpected<task_id_t>>>correctly distinguishes:
- outer submission errors (
submit_result.error()),- async scheduling errors (
task_result.error()),
and converts both to astd::expected<task_id_t, std::string>viaCraneErrStrforWriteTaskIdReply. That’s consistent with the rest of the error plumbing in this file.src/CraneCtld/CtldPublicDefs.cpp (4)
22-22: Include dependency onTaskScheduler.hlooks appropriateIncluding
TaskScheduler.hhere is reasonable now thatTaskInCtldcallsg_task_schedulerin its dependency helpers; no issues spotted.
122-137:DependenciesInJob::updatelogic is consistent with OR/AND semanticsThe implementation correctly:
- Finds the dependee, logs and returns if missing.
- Computes
dep_ready_time = event_time + delay_seconds.- Aggregates via
minfor OR andmaxfor AND, then erases the processed entry.This matches the usual “earliest satisfied OR / latest satisfied AND” pattern and prepares state for
is_met/is_failedwithout obvious edge-case bugs.
1083-1084: Initializing dependency state fromTaskToCtldis consistentWiring
SetDependency(val.dependencies())intoSetFieldsByTaskToCtldis the right place to initialize per‑task dependency state during both new submissions and recovery. This keeps all proto→in‑memory mapping centralized.
1178-1199:TaskInfo.dependency_statusserialization appears correctThe new
dependency_statusblock:
- Only emits when there are outstanding deps or a non‑default
ready_time.- Reports
is_or, a list of still‑pending conditions, and eitherready_timeor the infinite‑past/future sentinels.This matches the internal
DependenciesInJobmodel and should be sufficient for front‑end/CLI displays to distinguish “waiting for deps” vs “never satisfiable” (assumingis_failed()is implemented consistently).src/CraneCtld/TaskScheduler.h (1)
764-768: API and queue changes for dependency-aware, error-aware submission look coherent
- Changing
SubmitTaskAsyncto returnstd::future<CraneExpected<task_id_t>>andSubmitTaskToSchedulerto returnCraneExpected<std::future<CraneExpected<task_id_t>>>cleanly separates synchronous validation errors from async submission/DB failures.- Updating
m_submit_task_queue_to carrystd::promise<CraneExpected<task_id_t>>is consistent with the new return type, and all call sites in the cpp now set either an id or anstd::unexpectedvalue.- The
DependencyEventstruct plusm_dependency_event_queue_andAddDependencyEventinline helper provide a simple multi‑producer/single‑consumer channel fromTaskInCtldback into the scheduler loop. Usingabsl::Timehere matches the rest of the scheduling code.I don’t see interface or lifetime issues with these changes.
Also applies to: 779-781, 880-887, 1026-1028, 1068-1075
src/CraneCtld/TaskScheduler.cpp (5)
843-859: Dependency event consumption and scheduling gatingInside
ScheduleThread_:
- Dequeuing
DependencyEvents into a local vector and callingUpdateDependencywhile holdingm_pending_task_map_mtx_ensures that dependency state for pending jobs is updated atomically with respect to scheduling.- The subsequent check:
if (!job->Dependencies().is_met(now)) { if (job->Dependencies().is_failed()) job->pending_reason = "DependencyNeverSatisfied"; else job->pending_reason = "Dependency"; continue; }cleanly separates “still waiting” vs “logically impossible” dependencies at the scheduler level.
Assuming
DependenciesInJob::is_met()/is_failed()interpretready_time == InfiniteFuture()as “never satisfiable”, this should behave as expected.Also applies to: 872-879
1290-1292: TriggeringAFTERevents on start-time is consistentCalling:
job->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER, job->StartTime());when jobs transition to running is the right place to satisfy
AFTERdependencies, and using the actual start time preserves correct behavior for delayeddelay_secondsconstraints (including jobs that were already started before a dependent was submitted and attached viaAddDependent).
1376-1385: NewSubmitTaskAsyncimplementation matches queue typeThe new
SubmitTaskAsynccorrectly:
- Constructs a
std::promise<CraneExpected<task_id_t>>,- Enqueues
{std::unique_ptr<TaskInCtld>, promise}intom_submit_task_queue_, and- Returns the associated
std::future<CraneExpected<task_id_t>>.No obvious ownership or lifetime issues; the promise is always either fulfilled in
CleanSubmitQueueCb_or in one of the error branches.
2528-2646: Runtime dependency wiring and missing-dependency rejection on submissionThe changes in
CleanSubmitQueueCb_do several important things:
- Switch
SubmitQueueElemto usestd::promise<CraneExpected<task_id_t>>and correctly fulfill it with eitheridorstd::unexpected(err).- For each accepted task, attach it as a dependent to any dependees found in
m_pending_task_map_orm_running_task_map_viaAddDependent, under the appropriate locks.- If any dependee is missing from both pending and running maps, log a warning, free QoS, fulfill the promise with
ERR_MISSING_DEPENDENCY, and record the job for purging from the embedded DB.This ensures:
- All dependency edges are hooked up before the job becomes visible in the pending queue.
- Submissions with obviously invalid dependency references are rejected quickly and cleaned up in both RAM and the embedded DB.
One behavioral nuance to be aware of: unlike the recovery path, new submissions do not consult MongoDB for already‑completed dependees; they are simply rejected as “missing dependency”. If you expect users to be able to submit “after X” jobs where X has already finished, this will currently fail. If that’s not supported by design, the behavior here is consistent, but it may be worth making that explicit in user‑facing docs.
2783-2796: Status-change drivenAFTER_*event triggeringIn
CleanTaskStatusChangeQueueCb_, when a job finishes you now:
- Compute an accurate
end_timefrom thegoogle::protobuf::Timestampprovided by Craned and calltask->SetEndTime(end_time).- Trigger:
task->TriggerDependencyEvents(AFTER_ANY, end_time); task->TriggerDependencyEvents(AFTER_OK, task_exit_code == 0 ? end_time : absl::InfiniteFuture()); task->TriggerDependencyEvents(AFTER_NOT_OK, task_exit_code != 0 ? end_time : absl::InfiniteFuture());This nicely maps completion semantics into the dependency model:
AFTER_ANY: always signaled at actual end time.AFTER_OK: only signaled for zero exit code; otherwise theInfiniteFuturetimestamp letsDependenciesInJobtreat it as never satisfiable.AFTER_NOT_OK: symmetric to the above.The only gap, as noted earlier, is that pending‑state cancellations never pass through this path, so
AFTER_NOT_OKon those jobs requires special handling if it’s meant to be satisfiable.
| std::unordered_map< | ||
| task_id_t, std::tuple<crane::grpc::TaskStatus, uint32_t, int64_t, int64_t>> | ||
| MongodbClient::FetchJobStatus(const std::unordered_set<task_id_t>& job_ids) { | ||
| std::unordered_map<task_id_t, std::tuple<crane::grpc::TaskStatus, uint32_t, | ||
| int64_t, int64_t>> | ||
| result; | ||
|
|
||
| if (job_ids.empty()) { | ||
| return result; | ||
| } | ||
|
|
||
| try { | ||
| document filter; | ||
| filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) { | ||
| array task_id_array; | ||
| for (const auto& job_id : job_ids) { | ||
| task_id_array.append(static_cast<std::int32_t>(job_id)); | ||
| } | ||
| task_id_doc.append(kvp("$in", task_id_array)); | ||
| })); | ||
|
|
||
| mongocxx::options::find options; | ||
| document projection; | ||
| projection.append(kvp("task_id", 1)); | ||
| projection.append(kvp("state", 1)); | ||
| projection.append(kvp("exit_code", 1)); | ||
| projection.append(kvp("time_end", 1)); | ||
| projection.append(kvp("time_start", 1)); | ||
| options.projection(projection.view()); | ||
|
|
||
| mongocxx::cursor cursor = | ||
| (*GetClient_())[m_db_name_][m_task_collection_name_].find(filter.view(), | ||
| options); | ||
|
|
||
| for (auto view : cursor) { | ||
| task_id_t task_id = view["task_id"].get_int32().value; | ||
| crane::grpc::TaskStatus status = | ||
| static_cast<crane::grpc::TaskStatus>(view["state"].get_int32().value); | ||
| uint32_t exit_code = view["exit_code"].get_int32().value; | ||
| int64_t time_end = view["time_end"].get_int64().value; | ||
| int64_t time_start = view["time_start"].get_int64().value; | ||
|
|
||
| result.emplace(task_id, | ||
| std::make_tuple(status, exit_code, time_end, time_start)); | ||
| } | ||
| } catch (const std::exception& e) { | ||
| CRANE_ERROR("Failed to fetch job status by IDs: {}", e.what()); | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard FetchJobStatus against step-only documents (missing has_job_info)
The query currently filters only on task_id, so it can match documents created by InsertSteps that contain just task_id and steps but lack state/exit_code/time_start/time_end. Accessing those fields will throw, hit the catch, and you’ll return an empty/partial map indistinguishable from “no rows”.
To align with FetchJobRecords() and avoid this silent drop, consider restricting to full job docs:
try {
- document filter;
- filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) {
+ document filter;
+ // Only query documents with complete job information.
+ filter.append(kvp("has_job_info", true));
+ filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) {
array task_id_array;
for (const auto& job_id : job_ids) {
task_id_array.append(static_cast<std::int32_t>(job_id));
}
task_id_doc.append(kvp("$in", task_id_array));
}));If you also need to distinguish “query failed” from “no such tasks”, you might later wrap the result in a CraneExpected<...> instead of returning a bare map, but that’s optional for this PR.
🧰 Tools
🪛 Cppcheck (2.18.0)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
🤖 Prompt for AI Agents
In src/CraneCtld/DbClient.cpp around lines 498-548, FetchJobStatus currently
queries only by task_id and can match step-only documents that lack job fields,
causing field access to throw and return an indistinguishable empty/partial map;
update the MongoDB filter to only select full job documents (e.g., add a
condition such as "has_job_info": true or require existence of
"state"/"time_start"/"time_end" using $exists/$type) so only documents
containing state/exit_code/time_start/time_end are returned, and keep the
existing projection and field reads unchanged.
c38d42e to
da432fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/CraneCtld/DbClient.cpp (1)
498-548: GuardFetchJobStatusagainst step-only documentsThe query filters only on
task_idwithout ensuring the document has complete job info. Documents created byInsertStepsonly havetask_idandstepsarray but lackstate,exit_code,time_start,time_end. Accessing these fields on such documents will throw an exception, causing the function to return an incomplete map.Add
has_job_info: trueto the filter to match only complete job documents, consistent withFetchJobRecords():try { document filter; + filter.append(kvp("has_job_info", true)); filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) { array task_id_array; for (const auto& job_id : job_ids) { task_id_array.append(static_cast<std::int32_t>(job_id)); } task_id_doc.append(kvp("$in", task_id_array)); }));src/CraneCtld/TaskScheduler.cpp (1)
2486-2495: Cancelled pending tasks still missing AFTER_NOT_OK events and exit code.This is the same issue raised in previous reviews: when pending tasks are cancelled, they don't set a non-zero exit code or trigger
AFTER_NOT_OKdependency events, leaving dependents stuck.Compare to task completion (lines 2787-2795), which conditionally fires
AFTER_OKandAFTER_NOT_OKbased onexit_code. Apply the same logic here:
- Set
task->SetExitCode(ExitCode::EC_TERMINATED)(or an appropriate cancellation code) before line 2489.- Add
task->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER_NOT_OK, end_time)after line 2494.
🧹 Nitpick comments (3)
src/CraneCtld/CtldPublicDefs.cpp (2)
22-22: Consider reducing coupling between CtldPublicDefs and TaskScheduler.The include creates a circular dependency between these headers (CtldPublicDefs.h ← TaskScheduler.h ← CtldPublicDefs.h). While this works due to include guards, it tightly couples the modules. Consider one of these refactors for future work:
- Inject the event queue or callback into TaskInCtld rather than accessing
g_task_schedulerdirectly.- Use an observer/listener pattern where TaskInCtld notifies registered listeners.
- Move dependency event triggering to TaskScheduler itself.
122-137: Consider improving error handling in dependency updates.If a dependency is not found in the
depsmap (Line 124), the method logs an error but continues without indicating failure to the caller. This could leave dependency state inconsistent—the task won't know a dependency update was missed.Consider returning a boolean success indicator or using CraneExpected to allow callers to detect and handle missing dependencies (e.g., by querying the database or failing the dependent task).
src/CraneCtld/TaskScheduler.cpp (1)
448-459: Verify AFTER dependency semantics for cancelled tasks without start time.For
AFTERdependencies, lines 452-454 handle cancelled tasks by usingtime_end, but the logic in lines 450-458 suggests that if a dependee was cancelled before it started (i.e.,time_start == 0), the event time falls through to the error case (line 455-457).Confirm whether
AFTERdependencies on a pending‑cancelled job (notime_start) should:
- Fire immediately using
time_end(current behavior forstatus == Cancelled),- Never satisfy (remain at
InfiniteFuture), or- Be treated as an error condition.
If the intent is that
AFTERfires when the dependee starts or is cancelled before starting, document this explicitly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
docs/en/command/calloc.md(1 hunks)docs/en/command/cbatch.md(1 hunks)docs/en/command/crun.md(1 hunks)docs/zh/command/calloc.md(1 hunks)docs/zh/command/cbatch.md(1 hunks)docs/zh/command/crun.md(1 hunks)protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.cpp(5 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(1 hunks)src/CraneCtld/DbClient.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(13 hunks)src/CraneCtld/TaskScheduler.h(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CraneCtld/RpcService/CtldGrpcServer.cpp
🧰 Additional context used
🧠 Learnings (16)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/DbClient.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.
Applied to files:
src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (3)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SubmitTaskAsync(1376-1385)SubmitTaskAsync(1376-1377)SubmitTaskToScheduler(1521-1588)SubmitTaskToScheduler(1522-1522)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
AddDependencyEvent(881-887)
src/CraneCtld/CtldPublicDefs.h (1)
src/CraneCtld/CtldPublicDefs.cpp (10)
update(122-137)update(122-122)SetDependency(989-1000)SetDependency(989-989)UpdateDependency(1002-1004)UpdateDependency(1002-1002)AddDependent(1006-1015)AddDependent(1006-1007)TriggerDependencyEvents(1017-1022)TriggerDependencyEvents(1017-1018)
🪛 Cppcheck (2.18.0)
src/CraneCtld/CtldPublicDefs.cpp
[information] 22-22: Include file
(missingIncludeSystem)
[information] 22-22: Include file
(missingInclude)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
src/CraneCtld/TaskScheduler.cpp
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
src/CraneCtld/DbClient.cpp
[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.
(normalCheckLevelMaxBranches)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (29)
docs/zh/command/calloc.md (1)
37-37: LGTM!The new
-d, --dependencyoption is well documented with clear format specifications and supported dependency types.docs/zh/command/crun.md (1)
15-15: LGTM!Documentation for the dependency option is consistent with the other command documentation in this PR.
docs/en/command/calloc.md (1)
37-37: LGTM!The dependency option documentation is clear and consistent with the Chinese documentation.
docs/en/command/crun.md (1)
15-15: LGTM!Documentation is consistent with other command documentation files.
docs/en/command/cbatch.md (1)
72-72: LGTM!The dependency option is appropriately placed under Scheduling Options and documented consistently with other commands.
docs/zh/command/cbatch.md (1)
72-72: LGTM!Chinese documentation is consistent with the English version.
src/CraneCtld/DbClient.h (1)
122-125: LGTM! Clear API for dependency resolution.The method signature follows existing patterns in MongodbClient and provides the necessary status information for the dependency subsystem to resolve missing dependee tasks.
protos/PublicDefs.proto (4)
135-162: Well-designed dependency modeling.The dependency types align with standard scheduler semantics (similar to Slurm's dependency options), and the use of
is_orto distinguish AND/OR logic is clear. Theready_timeoneof handling infinite future/past states elegantly represents pending/failed conditions without requiring sentinel values.
186-186: LGTM! Dependency field added to task submission.Field number and placement are appropriate.
485-485: LGTM! Exposes dependency status for client queries.The optional field allows tasks without dependencies to omit the status, reducing message size.
730-731: LGTM! Error codes support dependency handling.Sequential numbering and clear names for dependency-related error cases.
src/CraneCtld/CtldPublicDefs.h (4)
414-430: Dependency state logic is sound.The
is_met()andis_failed()methods correctly implement AND/OR semantics:
- AND dependencies: All must succeed (deps empty + ready_time reached), any failure causes task failure (ready_time = InfiniteFuture).
- OR dependencies: Any success satisfies (ready_time = earliest), all must fail to cause task failure (deps empty + InfiniteFuture).
The
update()method signature indicates it will process events and maintain consistency betweendepsmap andready_time.
735-735: LGTM! Efficient dependent tracking.Using an array indexed by
DependencyTypeprovides O(1) access when triggering dependency events for each condition type.
748-748: LGTM! Per-task dependency state.Stores the dependency conditions this task must satisfy before running.
923-928: LGTM! Complete dependency management API.The methods cover the full dependency lifecycle: initialization from submission data, event processing, dependent registration, and event propagation on status changes.
src/CraneCtld/CtldPublicDefs.cpp (4)
989-1000: LGTM! Correct dependency initialization.The initialization logic correctly sets
ready_timetoInfiniteFuturefor OR dependencies (will be reduced to earliest satisfied time) andInfinitePastfor AND dependencies (will be increased to latest satisfied time).
1002-1004: LGTM! Clean delegation.Delegates to
DependenciesInJob::update()as expected.
1017-1022: LGTM! Dependency event propagation.Enqueues events for all registered dependents. The use of
g_task_scheduler->AddDependencyEvent()relies on the global being initialized at startup, which is consistent with the system's initialization order.
1178-1199: LGTM! Efficient dependency status serialization.Only serializes dependency status when present, and correctly handles the three
ready_timestates (normal timestamp, infinite future for pending OR, infinite past for initial AND) using the protobuf oneof.src/CraneCtld/TaskScheduler.h (4)
764-767: LGTM! Improved error handling via CraneExpected.Wrapping the result in
CraneExpected<task_id_t>enables callers to distinguish between successful submission (with new task ID) and failures (e.g., missing dependencies, validation errors). This is essential for the dependency subsystem to provide meaningful feedback.
779-780: LGTM! Two-stage error handling for async submission.The nested
CraneExpectedstructure correctly models two error stages:
- Synchronous validation (outer): Authorization, resource checks, etc.
- Asynchronous processing (inner): DB insertion, dependency resolution, etc.
This aligns with the async submission workflow.
880-887: LGTM! Clean dependency event API.The inline implementation is straightforward, and the comment documenting the
InfiniteFutureconvention for failed dependencies is helpful for maintainability.
1068-1074: LGTM! Event queue for dependency propagation.The
DependencyEventstructure captures the necessary information (which task depends on which, when it was satisfied), andConcurrentQueueensures thread-safe async processing.src/CraneCtld/TaskScheduler.cpp (6)
843-859: LGTM! Dependency event processing is correct.The batch dequeue and update loop properly propagates dependency events to pending tasks.
872-879: LGTM! Dependency-aware scheduling guard is correct.Tasks with unsatisfied dependencies are correctly skipped with clear pending reasons.
1290-1292: LGTM! AFTER dependency event triggered at task start.Correctly fires
AFTERdependency events when a task begins execution.
1376-1385: LGTM! CraneExpected refactoring is consistent and correct.The signature changes properly wrap async submission results in
CraneExpectedfor improved error propagation. The nestedCraneExpected<std::future<CraneExpected<task_id_t>>>correctly separates immediate validation errors (outer) from async submission errors (inner).Also applies to: 1521-1588
2588-2646: Dependency validation and missing-dependency rejection logic is correct.The flow properly:
- Persists tasks to embedded DB (line 2575-2584)
- Validates dependencies against pending/running maps (lines 2595-2628)
- Rejects tasks with missing dependencies (lines 2620-2627) by setting
ERR_MISSING_DEPENDENCYand skipping pending-map insertion- Purges rejected tasks from embedded DB (lines 2638-2645)
Note: Tasks are written to embedded DB before dependency validation, so rejected tasks require cleanup. While this could be optimized by checking dependencies first, the current approach is acceptable given that missing dependencies should be rare.
2783-2795: LGTM! Task completion dependency event handling is the correct pattern.This properly triggers all dependency types based on exit code:
AFTER_ANYfires unconditionally withend_timeAFTER_OKfires only whenexit_code == 0AFTER_NOT_OKfires only whenexit_code != 0Using
InfiniteFuturefor unsatisfied conditions correctly prevents dependent tasks from proceeding. This is the pattern that should be applied to cancelled pending tasks (lines 2486-2495).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
docs/zh/reference/job_dependency.md (1)
27-29: Specify language identifiers on fenced code blocks.Code blocks at lines 27 and 164 are missing language specifications. Add
textidentifier for syntax format and output examples.- ``` + ```text <type>:<job_id>[+<delay>][:<job_id>[+<delay>]]... - ``` + ```textand
- ``` + ```text JobId=105 ... Dependency=PendingDependencies=afterok:100+01:00:00 Status=WaitForAll - ``` + ```textAlso applies to: 164-168
docs/zh/reference/pending_reason.md (1)
16-22: Specify language identifiers on fenced code blocks.Code blocks at lines 16 and 31 are missing language specifications. Add
textidentifier for command outputs.- ``` + ```text JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 101 CPU job1 user1 PD 0:00 2 (Priority) 102 CPU job2 user1 PD 0:00 4 (Resource) 103 GPU job3 user2 PD 0:00 1 (Dependency) 104 CPU job4 user1 PD 0:00 2 (Held) - ``` + ```textand
- ``` + ```text JobId=101 ... State=PENDING Reason=Priority - ``` + ```textAlso applies to: 31-36
docs/en/reference/pending_reason.md (1)
16-22: Specify language identifiers on fenced code blocks.Code blocks at lines 16 and 31 are missing language specifications. Add
textidentifier for command outputs.- ``` + ```text JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 101 CPU job1 user1 PD 0:00 2 (Priority) 102 CPU job2 user1 PD 0:00 4 (Resource) 103 GPU job3 user2 PD 0:00 1 (Dependency) 104 CPU job4 user1 PD 0:00 2 (Held) - ``` + ```textand
- ``` + ```text JobId=101 ... State=PENDING Reason=Priority - ``` + ```textAlso applies to: 31-36
docs/en/reference/job_dependency.md (1)
27-29: Specify language identifiers on fenced code blocks.Code blocks at lines 27 and 164 are missing language specifications. Add
textidentifier for syntax format and output examples.- ``` + ```text <type>:<job_id>[+<delay>][:<job_id>[+<delay>]]... - ``` + ```textand
- ``` + ```text JobId=105 ... Dependency=PendingDependencies=afterok:100+01:00:00 Status=WaitForAll - ``` + ```textAlso applies to: 164-168
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/en/reference/job_dependency.md(1 hunks)docs/en/reference/pending_reason.md(1 hunks)docs/zh/reference/job_dependency.md(1 hunks)docs/zh/reference/pending_reason.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/zh/reference/job_dependency.md
[uncategorized] ~69-~69: 数词与名词之间一般应存在量词,可能缺少量词。
Context: ...100 开始 并且 作业 101 成功完成。 #### OR 逻辑(任一条件满足) 使用 ? 分隔不同的依赖条件: ```bash afterok:...
(wa5)
🪛 markdownlint-cli2 (0.18.1)
docs/en/reference/pending_reason.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/zh/reference/job_dependency.md
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
164-164: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/en/reference/job_dependency.md
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
164-164: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/zh/reference/pending_reason.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
ca1e707 to
a05a02c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)
2486-2498: Cancelled pending tasks still never satisfyAFTER_NOT_OKdependenciesFor cancelled pending tasks you now:
- set
status = Cancelledand a commonend_time,- fire
AFTERandAFTER_ANYwithend_time, and- fire
AFTER_OKandAFTER_NOT_OKwithabsl::InfiniteFuture().Given how
DependenciesInJob::updateaggregatesready_time, this means a dependent that specifies onlyAFTER_NOT_OKon a never-started job will end up withready_time == InfiniteFutureand (assumingis_failed()looks for that) will be stuck in"DependencyNeverSatisfied"rather than ever running. That partially addresses the earlier issue (no longer silently stuck in plain"Dependency"), but it still treats a cancelled pending job as neither OK nor NOT_OK.If the intended semantics are “
AFTER_NOT_OKruns when the dependee finishes unsuccessfully, including cancellations”, you likely want to:
- assign a non-zero exit code (e.g.
ExitCode::EC_TERMINATED) to cancelled pending tasks, and- call
TriggerDependencyEvents(DependencyType::AFTER_NOT_OK, end_time)(withAFTER_OKstaying atInfiniteFuture()).Otherwise, please document clearly that
AFTER_NOT_OKonly applies to jobs that actually started execution so users don’t expect it to trigger after cancelling a queued job.
🧹 Nitpick comments (2)
src/CraneCtld/CtldPublicDefs.cpp (1)
989-1000: Cleardependenciesbefore re-populating from gRPC to avoid stale stateIf
TaskInCtld::SetDependency()is ever invoked more than once on the same instance (e.g. requeue/update), old entries remain for dependees that were dropped from the protobuf. Safer to reset before filling:void TaskInCtld::SetDependency(const crane::grpc::Dependencies& grpc_deps) { - if (grpc_deps.is_or()) { - dependencies.is_or = true; - dependencies.ready_time = absl::InfiniteFuture(); - } else { - dependencies.is_or = false; - dependencies.ready_time = absl::InfinitePast(); - } - for (const auto& dep : grpc_deps.deps()) { - dependencies.deps[dep.job_id()] = {dep.type(), dep.delay_seconds()}; - } + dependencies.deps.clear(); + dependencies.is_or = grpc_deps.is_or(); + dependencies.ready_time = grpc_deps.is_or() ? absl::InfiniteFuture() + : absl::InfinitePast(); + for (const auto& dep : grpc_deps.deps()) { + dependencies.deps[dep.job_id()] = {dep.type(), dep.delay_seconds()}; + } }src/CraneCtld/TaskScheduler.cpp (1)
1376-1385: CraneExpected-wrapped async submission path is wired correctly
SubmitTaskAsyncnow queues{std::unique_ptr<TaskInCtld>, std::promise<CraneExpected<task_id_t>>}and returns astd::future<CraneExpected<task_id_t>>, whileSubmitTaskToSchedulervalidates the task, does QoS accounting, and on success returns that future wrapped inCraneExpected. This matches the updatedCleanSubmitQueueCb_logic (which fulfills the promise with either a task id or an error). Minor style nit: insideSubmitTaskToScheduler, callingSubmitTaskAsync(std::move(task))onthisinstead of viag_task_schedulerwould avoid another hard dependency on the global.Also applies to: 1521-1585
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/CtldPublicDefs.cpp(5 hunks)src/CraneCtld/TaskScheduler.cpp(13 hunks)
🧰 Additional context used
🧠 Learnings (12)
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
AddDependencyEvent(881-887)
🔇 Additional comments (7)
src/CraneCtld/CtldPublicDefs.cpp (3)
122-137: Dependency aggregation inDependenciesInJob::updatelooks correctUsing
event_time + delay_secondsand aggregating viamin/maxfor OR/AND, then erasing the satisfied entry, matches the model implied byis_met/is_failedand the scheduler’s dependency event queue. No change needed.
1006-1022:AddDependent/TriggerDependencyEventsintegrate cleanly with the scheduler queueShort‑circuiting
AFTERwhen the dependee is alreadyRunningby immediately enqueueing a dependency event, and otherwise recording dependents perDependencyType, aligns with howScheduleThread_drainsm_dependency_event_queue_viaUpdateDependency(). AssumingTaskInCtldlives only under a validg_task_scheduler, this is fine.
1178-1199:TaskInfo.dependency_statusserialization covers pending, finite, and ±∞ readinessConditioning emission on either non‑empty deps or a non‑
InfinitePastready_time, and then mappingready_timetoinfinite_future/infinite_pastflags or a concrete Unix seconds value, gives clients enough information to distinguish “no deps”, “waiting”, “ready at T”, and “never/always ready”. Looks good as-is.src/CraneCtld/TaskScheduler.cpp (4)
399-487: Init-time dependency reconstruction and MongoDB resolution look coherent; verifyis_failed()semanticsThe recovery block correctly:
- builds a dependee→dependents index from
m_pending_task_map_,- attaches dependents to recovered pending/running tasks, and
- for truly missing dependees, queries MongoDB and computes
event_timeperDependencyTypebefore enqueuingAddDependencyEvent(job_id, dep_id, event_time).This relies on
DependenciesInJob::is_failed()treating “all events delivered andready_time == absl::InfiniteFuture()” as permanently unsatisfiable; otherwise dependents of finished dependees with unsatisfiedAFTER_OK/AFTER_NOT_OKcould stay in"Dependency"rather than"DependencyNeverSatisfied". Please double-check that implementation matches this assumption.
843-859: Dependency event pump and gating inScheduleThread_are consistentDraining
m_dependency_event_queue_underm_pending_task_map_mtx_, applying each event viait->second->UpdateDependency(event.dependee_job_id, event.event_time), and then gating scheduling on!Dependencies().is_met(now)with aDependency/DependencyNeverSatisfiedpending reason uses the new model as intended. EmittingAFTERevents at job start (TriggerDependencyEvents(DependencyType::AFTER, StartTime())) plugs into the same pipeline cleanly.Also applies to: 872-879, 1290-1292
2532-2535: Submission-time dependency registration and missing-dep rejection behaviour
CleanSubmitQueueCb_now:
- stores promises as
std::promise<CraneExpected<task_id_t>>,- for each accepted task, wires its dependencies to existing pending dependees (under
m_pending_task_map_mtx_) and then to running dependees (underm_running_task_map_mtx_), and- if any dependee is neither pending nor running, rejects the task with
ERR_MISSING_DEPENDENCY, frees QoS, and schedules its DB entry for purge (tasks_to_purge).Functionally this is sound and avoids leaving such tasks in a permanent “dependency” state. Just note the behavioural change: jobs depending on already-finished or unknown jobs now fail fast at submission instead of entering the queue and later becoming
"DependencyNeverSatisfied". If that’s the intended UX for--dependency, it’s worth making explicit in the docs.Also applies to: 2585-2586, 2592-2649, 2664-2666
2787-2799: Final-state dependency events correctly differentiate ANY / OK / NOT_OKUsing the gRPC
timestampto computeend_time, then:
- always firing
AFTER_ANYatend_time,- firing
AFTER_OKatend_timeonly whenexit_code == 0(elseInfiniteFuture), and- firing
AFTER_NOT_OKatend_timeonly whenexit_code != 0(elseInfiniteFuture),aligns with the
DependenciesInJobaggregation logic and gives predictable behaviour for dependents of completed/failed jobs.
cbb7326 to
3ef3692
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
442-468: Maintain alignment between task_id_list and code_list in SubmitBatchTasksThe success path appends only to
task_id_listwithout appending tocode_list, while error paths append to both. This breaks the 1:1 correspondence between the repeated fields defined inSubmitBatchTasksReply, making it impossible for clients to correlate task results with error codes when some or all tasks succeed.Always append a code entry for each task:
for (auto& res : results) { if (res.has_value()) { CraneExpected<task_id_t> task_result = res.value().get(); if (task_result.has_value()) { response->mutable_task_id_list()->Add(task_result.value()); + response->mutable_code_list()->Add(CraneErrCode::SUCCESS); } else { response->mutable_task_id_list()->Add(0); response->mutable_code_list()->Add(task_result.error()); } } else { response->mutable_task_id_list()->Add(0); response->mutable_code_list()->Add(res.error()); } }
♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)
2515-2528: Cancelled pending tasks trigger incorrect dependency eventsWhen pending tasks are cancelled, the code triggers both
AFTER_OKandAFTER_NOT_OKwithabsl::InfiniteFuture():task->TriggerDependencyEvents(crane::grpc::AFTER_OK, absl::InfiniteFuture()); task->TriggerDependencyEvents(crane::grpc::AFTER_NOT_OK, absl::InfiniteFuture());This means dependents waiting on either
AFTER_OKorAFTER_NOT_OKwill never have their dependencies satisfied, causing them to remain stuck in"Dependency"state indefinitely.Semantic issue: A cancelled job should be treated as a failure. Therefore:
AFTER_NOT_OKshould fire withend_time(dependency satisfied at cancellation time)AFTER_OKshould fire withabsl::InfiniteFuture()(will never be satisfied)Additionally, no exit code is set before triggering events. Set a non-zero exit code (e.g.,
ExitCode::EC_TERMINATED) before triggering dependency events.Apply this diff:
task->SetStatus(crane::grpc::Cancelled); task->SetEndTime(end_time); g_account_meta_container->FreeQosResource(*task); + task->SetExitCode(ExitCode::EC_TERMINATED); task->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER, end_time); task->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER_ANY, end_time); task->TriggerDependencyEvents(crane::grpc::AFTER_OK, - absl::InfiniteFuture()); + absl::InfiniteFuture()); // Never satisfied task->TriggerDependencyEvents(crane::grpc::AFTER_NOT_OK, - absl::InfiniteFuture()); + end_time); // Satisfied at cancel time
🧹 Nitpick comments (8)
docs/en/reference/pending_reason.md (1)
16-22: Add languages to fenced code blocks to satisfy markdownlintThe two “Example output” blocks use bare fenced code; markdownlint (MD040) flags this. Consider marking them as plain text:
-Example output: -``` +Example output: +```text JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ... -``` +``` ... -Example output: -``` +Example output: +```text JobId=101 ... State=PENDING Reason=Priority -``` +```This keeps the rendering while making markdownlint happy.
Based on static analysis hints.
Also applies to: 31-36
docs/zh/reference/job_dependency.md (1)
27-29: Specify languages on plain fenced blocks to satisfy markdownlintTwo fenced blocks are missing a language (MD040):
- Dependency syntax example
ccontrol show joboutput exampleYou can mark them as plain text:
-``` +```text <type>:<job_id>[+<delay>][:<job_id>[+<delay>]]... -``` +``` ... -``` +```text JobId=105 ... Dependency=PendingDependencies=afterok:100+01:00:00 Status=WaitForAll -``` +```This keeps rendering unchanged while clearing the lint warnings.
Based on static analysis hints.
Also applies to: 164-168
docs/en/reference/job_dependency.md (1)
27-29: Add languages to plain fenced blocks for markdownlintThe dependency syntax block and
ccontrol show joboutput use bare ``` fences, which triggers MD040. Consider marking them as text:-``` +```text <type>:<job_id>[+<delay>][:<job_id>[+<delay>]]... -``` +``` ... -``` +```text JobId=105 ... Dependency=PendingDependencies=afterok:100+01:00:00 Status=WaitForAll -``` +```This keeps the examples readable while satisfying markdownlint.
Based on static analysis hints.
Also applies to: 164-168
src/CraneCtld/DbClient.h (1)
122-125: Use a named type for the job-status tuple and confirm task_id_t widthThe new API is clear in the comment, but the raw tuple type:
std::tuple<crane::grpc::TaskStatus, uint32_t, int64_t, int64_t>is not self‑describing at call sites and is easy to misuse if the field order changes.
Consider introducing a named alias or small struct and using it in the map:
// Near other type aliases using JobStatusInfo = std::tuple<crane::grpc::TaskStatus, uint32_t, int64_t, int64_t>; // or a struct with named fields if you preferand then:
- // Fetch job status (state, exit_code, time_end, time_start) - std::unordered_map<task_id_t, std::tuple<crane::grpc::TaskStatus, uint32_t, - int64_t, int64_t>> + // Fetch job status (state, exit_code, time_end, time_start) + std::unordered_map<task_id_t, JobStatusInfo> FetchJobStatus(const std::unordered_set<task_id_t>& job_ids);Also, the implementation stores and queries
task_idasint32_t. Please double‑check thattask_id_tcannot exceed 32‑bit range; otherwise, you may want to store it as 64‑bit in Mongo to avoid silent truncation.src/CraneCtld/CtldPublicDefs.cpp (4)
21-23: Including TaskScheduler here is reasonablePulling in
TaskScheduler.hto support the newg_task_scheduler->AddDependencyEvent(...)calls fromTaskInCtldis appropriate. If include depth ever becomes an issue, those calls could be wrapped behind a small helper inTaskSchedulerto reduce coupling, but it’s not required now.
992-1007: Make SetDependency robust to repeated calls
SetDependencyinitializesis_orandready_timebased on the incomingDependenciesand then just appends/overwrites entries independencies.deps. If this method is ever called more than once on the sameTaskInCtldinstance (e.g., recovery/requeue paths), stale entries could remain in the map.You can defensively reset the struct:
void TaskInCtld::SetDependency(const crane::grpc::Dependencies& grpc_deps) { - if (grpc_deps.is_or()) { + dependencies.deps.clear(); + + if (grpc_deps.is_or()) { dependencies.is_or = true; dependencies.ready_time = absl::InfiniteFuture(); } else { dependencies.is_or = false; dependencies.ready_time = absl::InfinitePast(); } for (const auto& dep : grpc_deps.deps()) { dependencies.deps[dep.job_id()] = {dep.type(), dep.delay_seconds()}; } }This doesn’t change current behavior for first-time initialization but makes future reuse of
TaskInCtldsafer.
1020-1025: Consider clearing dependents after triggering dependency events
TriggerDependencyEventsiterates overdependents[dep_type]and callsAddDependencyEventfor each, but leaves the vector intact. If, for any reason, the same dependency event type is triggered multiple times for a job, dependents will receive redundant events.You may want to clear the list after firing:
void TaskInCtld::TriggerDependencyEvents( const crane::grpc::DependencyType& dep_type, absl::Time event_time) { - for (task_id_t dependent_id : dependents[dep_type]) { + auto& list = dependents[dep_type]; + for (task_id_t dependent_id : list) { g_task_scheduler->AddDependencyEvent(dependent_id, task_id, event_time); } + list.clear(); }This is not strictly required if the scheduler guarantees each event type is emitted at most once per job, but it makes the behavior more robust.
1181-1202: DependencyStatus serialization matches in-memory modelThe new block in
SetFieldsOfTaskInfo:
- emits
dependency_statusonly when there are deps or a non-InfinitePastready_time,- mirrors
dependencies.is_or,- serializes every remaining pending dependency, and
- encodes
ready_timeusing theready_time/infinite_future/infinite_pastoneof.This is consistent with how
DependenciesInJobandSetDependencyinitialize and updateready_time, and should give clients enough information to display both pending dependencies and their effective ready time.If you ever reuse a
TaskInfoinstance, consider clearingdependency_statusfirst to avoid duplicatependingentries, though current usage likely always constructs a fresh message.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
docs/en/command/calloc.md(1 hunks)docs/en/command/cbatch.md(1 hunks)docs/en/command/crun.md(1 hunks)docs/en/reference/job_dependency.md(1 hunks)docs/en/reference/pending_reason.md(1 hunks)docs/zh/command/calloc.md(1 hunks)docs/zh/command/cbatch.md(1 hunks)docs/zh/command/crun.md(1 hunks)docs/zh/reference/job_dependency.md(1 hunks)docs/zh/reference/pending_reason.md(1 hunks)mkdocs.yaml(2 hunks)protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.cpp(5 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(1 hunks)src/CraneCtld/DbClient.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(13 hunks)src/CraneCtld/TaskScheduler.h(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- docs/zh/command/calloc.md
- docs/en/command/crun.md
- docs/zh/reference/pending_reason.md
- src/CraneCtld/DbClient.cpp
- src/CraneCtld/CtldPublicDefs.h
- docs/en/command/cbatch.md
🧰 Additional context used
🧠 Learnings (14)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/TaskScheduler.h
🧬 Code graph analysis (2)
src/CraneCtld/DbClient.h (1)
src/CraneCtld/DbClient.cpp (2)
FetchJobStatus(498-548)FetchJobStatus(500-500)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
AddDependencyEvent(881-887)
🪛 LanguageTool
docs/zh/reference/job_dependency.md
[uncategorized] ~69-~69: 数词与名词之间一般应存在量词,可能缺少量词。
Context: ...100 开始 并且 作业 101 成功完成。 #### OR 逻辑(任一条件满足) 使用 ? 分隔不同的依赖条件: ```bash afterok:...
(wa5)
🪛 markdownlint-cli2 (0.18.1)
docs/en/reference/job_dependency.md
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
164-164: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/en/reference/pending_reason.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/zh/reference/job_dependency.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (11)
docs/zh/command/cbatch.md (1)
72-72: Dependency option description looks goodThe
-d/--dependencysyntax and listed types are consistent with the dedicated Job Dependency reference; no changes needed here.mkdocs.yaml (1)
85-86: Nav entries and translations for Job Dependency/Pending Reason look correctNew reference pages and their Chinese nav translations are wired consistently with existing entries; no further changes needed.
Also applies to: 124-125
docs/zh/command/crun.md (1)
15-15: Dependency option documentation is consistentThe
-d/--dependencydescription matches the cbatch docs and the Job Dependency reference page, so the CLI behavior is documented consistently.docs/en/command/calloc.md (1)
37-37: Dependency option text aligns with reference documentationThe
-d, --dependencyoption syntax and supported types align with the Job Dependency reference and other command docs; no changes needed.protos/PublicDefs.proto (4)
135-162: Dependency proto definitions are consistent and minimalThe new
DependencyType,DependencyCondition,Dependencies, andDependencyStatusmessages form a clean, orthogonal model that matches the usage in CtldPublicDefs (is_or + per-dependency delay + a single ready_time/∞ flag). No structural issues from a proto/generation standpoint.
186-187: Confirm that field number 15 was previously unused in TaskToCtldUsing
Dependencies dependencies = 15;is fine as long as tag 15 wasn’t populated by an older field in already-deployed clusters. If it was, old clients/servers would misinterpret the wire format.If you’re unsure, please double-check the git history of
TaskToCtld(or generated descriptors in a running deployment) to ensure 15 was never used before this change.
482-486: TaskInfo.dependency_status wiring matches implementationAdding
optional DependencyStatus dependency_status = 29;lines up withTaskInCtld::SetFieldsOfTaskInfo, which only populates the field when there are dependencies or a meaningful ready_time. This is backward compatible (higher tag, optional) and won’t affect older clients ignoring the new field.
730-731: New error codes integrate cleanly
ERR_MISSING_DEPENDENCY = 84andERR_DB_INSERT_FAILED = 85are appended at the end ofErrCode, preserving existing numeric assignments. This keeps wire compatibility while making dependency-specific failures distinguishable.src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
300-313: Nested CraneExpected unwrapping in CforedStream is correctThe conversion from
CraneExpected<std::future<CraneExpected<task_id_t>>>tostd::expected<task_id_t, std::string>correctly:
- unwraps the future,
- propagates the inner
CraneErrCodeviaCraneErrStr,- and falls back to the outer error if submission itself failed.
This preserves previous semantics while aligning with the new return type.
406-415: SubmitBatchTask now surfaces per-task CraneErrCode from async pathThe updated logic unwraps the inner
CraneExpected<task_id_t>and:
- sets
ok=trueandtask_idon success,- otherwise sets
ok=falseand assignsresponse->codefromtask_result.error(),
falling back to the outerresult.error()if scheduling failed before the future was created. This looks consistent with the new scheduler API.src/CraneCtld/CtldPublicDefs.cpp (1)
1086-1087: Wiring TaskToCtld.dependencies into TaskInCtld is appropriateCalling
SetDependency(val.dependencies());fromSetFieldsByTaskToCtldensures the in-memory dependency state is always derived from the protobuf submission payload. This ties the proto and internal representation together cleanly.
| void DependenciesInJob::update(task_id_t job_id, absl::Time event_time) { | ||
| auto it = deps.find(job_id); | ||
| if (it == deps.end()) { | ||
| CRANE_ERROR("Dependency for job {} not found", job_id); | ||
| return; | ||
| } | ||
| const auto& [dep_type, delay_seconds] = it->second; | ||
|
|
||
| absl::Time dep_ready_time = event_time + absl::Seconds(delay_seconds); | ||
| if (is_or) { | ||
| ready_time = std::min(ready_time, dep_ready_time); | ||
| } else { | ||
| ready_time = std::max(ready_time, dep_ready_time); | ||
| } | ||
| deps.erase(it); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused dep_type in DependenciesInJob::update
dep_type from it->second is never used, which will trigger -Wunused-variable warnings:
const auto& [dep_type, delay_seconds] = it->second;Consider dropping it or marking it intentionally unused, e.g.:
- const auto& [dep_type, delay_seconds] = it->second;
+ const auto& [/*dep_type*/, delay_seconds] = it->second;or by using std::get<1>(it->second) instead.
🤖 Prompt for AI Agents
In src/CraneCtld/CtldPublicDefs.cpp around lines 122 to 137, the structured
binding captures dep_type which is never used and triggers -Wunused-variable;
fix by omitting or marking it unused — either replace the binding with
extracting only the second element (e.g. use std::get<1>(it->second) to obtain
delay_seconds) or keep the structured binding but mark the first element
intentionally unused (e.g. [[maybe_unused]] or a named underscore) so only
delay_seconds is used when computing dep_ready_time.
| void TaskInCtld::AddDependent(crane::grpc::DependencyType dep_type, | ||
| task_id_t dep_job_id) { | ||
| if (dep_type == crane::grpc::DependencyType::AFTER && | ||
| status == crane::grpc::TaskStatus::Running) { | ||
| // already satisfied | ||
| g_task_scheduler->AddDependencyEvent(dep_job_id, task_id, start_time); | ||
| } else { | ||
| dependents[dep_type].push_back(dep_job_id); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the AddDependent method with surrounding context
find . -name "CtldPublicDefs.cpp" -type fRepository: PKUHPC/CraneSched
Length of output: 95
🏁 Script executed:
# Read the file around the specified lines to understand the full context
cat -n src/CraneCtld/CtldPublicDefs.cpp | sed -n '1000,1030p'Repository: PKUHPC/CraneSched
Length of output: 1361
🏁 Script executed:
# Search for TaskStatus enum to understand possible status values
rg "enum.*TaskStatus" --type cpp -A 20Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Look for TriggerDependencyEvents to understand when dependencies are released
rg "TriggerDependencyEvents" --type cpp -B 2 -A 5Repository: PKUHPC/CraneSched
Length of output: 3966
🏁 Script executed:
# Check how start_time is initialized in TaskInCtld
rg "start_time" --type cpp | grep -i "init\|=" | head -20Repository: PKUHPC/CraneSched
Length of output: 1711
🏁 Script executed:
# Search for HasStarted or similar methods in TaskInCtld
rg "HasStarted\|start_time\s*!=\s*0\|start_time.*valid" --type cpp -B 2 -A 2Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Look for TaskStatus enum definition
find . -name "*.proto" -type f | head -10Repository: PKUHPC/CraneSched
Length of output: 178
🏁 Script executed:
# Search for TaskStatus enum in proto files
rg "enum.*TaskStatus" --type proto -A 15Repository: PKUHPC/CraneSched
Length of output: 90
🏁 Script executed:
# Look for TaskInCtld constructor to see how start_time is initialized
rg "TaskInCtld::" --type cpp | grep -i constructRepository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Read TaskInCtld class definition to understand initialization
cat -n src/CraneCtld/CtldPublicDefs.h | head -100Repository: PKUHPC/CraneSched
Length of output: 4192
🏁 Script executed:
# Search for start_time initialization in TaskInCtld
rg "start_time\s*=" --type cpp | grep -i "ctldpublic\|taskinctld" -A 2 -B 2Repository: PKUHPC/CraneSched
Length of output: 658
🏁 Script executed:
# Look for how status transitions happen during job lifecycle
rg "status\s*=" --type cpp | grep -i "running\|completed\|failed" | head -15Repository: PKUHPC/CraneSched
Length of output: 718
🏁 Script executed:
# Check the complete context around where TriggerDependencyEvents(AFTER) is called at job start
rg "TriggerDependencyEvents.*AFTER" --type cpp -B 10 -A 3Repository: PKUHPC/CraneSched
Length of output: 3571
🏁 Script executed:
# Search for TaskStatus in proto files (use text search)
rg "TaskStatus" protos/PublicDefs.proto -A 20Repository: PKUHPC/CraneSched
Length of output: 3394
🏁 Script executed:
# Look for TaskInCtld class declaration to find member initialization
rg "class TaskInCtld" --type cpp -A 80Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Search for where start_time is first initialized in TaskInCtld
rg "start_time.*=" src/CraneCtld/CtldPublicDefs.cpp | head -20Repository: PKUHPC/CraneSched
Length of output: 256
🏁 Script executed:
# Look for the full TaskInCtld member declarations
cat -n src/CraneCtld/CtldPublicDefs.h | sed -n '1,300p' | tail -200Repository: PKUHPC/CraneSched
Length of output: 6430
🏁 Script executed:
# Check if there's any code that re-processes already-finished jobs for dependencies
rg "AddDependent" --type cpp -B 5 -A 5Repository: PKUHPC/CraneSched
Length of output: 4529
🏁 Script executed:
# Look for TaskInCtld class definition and member variables
rg "class TaskInCtld" --type cpp -A 100 | head -120Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Search for start_time initialization or default value
rg "absl::Time.*start_time" src/CraneCtld/CtldPublicDefs.h -A 2 -B 2Repository: PKUHPC/CraneSched
Length of output: 1018
🏁 Script executed:
# Look for HasStarted or similar method in TaskInCtld
rg "HasStarted|has.*started" --type cpp -iRepository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Find the constructor or initialization of TaskInCtld to see default start_time
rg "TaskInCtld::" src/CraneCtld/CtldPublicDefs.cpp | head -20Repository: PKUHPC/CraneSched
Length of output: 1078
🏁 Script executed:
# Look at the else clause handling missing dependees
rg "missing_dependee_ids" src/CraneCtld/TaskScheduler.cpp -B 10 -A 10Repository: PKUHPC/CraneSched
Length of output: 1997
🏁 Script executed:
# Check how missing dependencies are handled
rg "missing_deps\|missing_dependee" src/CraneCtld/TaskScheduler.cpp -B 5 -A 5Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Get more context on the MongoDB fetch path for handling finished dependencies
rg "missing_dependee_ids" src/CraneCtld/TaskScheduler.cpp -A 40 | head -70Repository: PKUHPC/CraneSched
Length of output: 2864
🏁 Script executed:
# Look at the complete switch statement for AFTER handling
rg "case crane::grpc::DependencyType::AFTER" src/CraneCtld/TaskScheduler.cpp -A 20Repository: PKUHPC/CraneSched
Length of output: 1718
🏁 Script executed:
# Check if start_time has a default/initialization value
cat -n src/CraneCtld/CtldPublicDefs.cpp | grep -A 5 "m_start_time_"Repository: PKUHPC/CraneSched
Length of output: 910
🏁 Script executed:
# Look for where start_time gets set to understand its default value
rg "SetStartTime|m_start_time_\s*=" src/CraneCtld/CtldPublicDefs.cpp | head -10Repository: PKUHPC/CraneSched
Length of output: 400
🏁 Script executed:
# Look for TaskInCtld constructor to see if start_time has a default value
rg "TaskInCtld\(" src/CraneCtld/CtldPublicDefs.cpp -A 20 | head -40Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Check how status transitions are set during job lifecycle
rg "SetStatus.*Running\|SetStatus.*Completed" src/CraneCtld/TaskScheduler.cpp -B 2 -A 2 | head -40Repository: PKUHPC/CraneSched
Length of output: 43
🏁 Script executed:
# Verify when TriggerDependencyEvents(AFTER) is actually called
rg "TriggerDependencyEvents.*AFTER" src/CraneCtld/TaskScheduler.cpp -B 5 -A 2Repository: PKUHPC/CraneSched
Length of output: 1356
AFTER dependencies on already-finished jobs may never be released
TaskInCtld::AddDependent treats an AFTER dependency as "already satisfied" only when status == Running. If a dependent job is created after the dependee has finished (Completed/Failed/Cancelled), status will no longer be Running, so the dependent ID is stored in dependents[AFTER]. Since TriggerDependencyEvents(AFTER, ...) is called at job start and completion (both before the dependent is added), the dependency may remain permanently unsatisfied.
This is realistic: users can submit --dependency=after:<finished_job_id> before the finished job is archived to the database.
The existing code in the MongoDB fetch path (for archived jobs) correctly handles this case by checking if (time_start != 0) instead of status, indicating that the correct fix is to check whether the job has ever started, not whether it is currently running. Consider applying the same logic in AddDependent:
void TaskInCtld::AddDependent(crane::grpc::DependencyType dep_type,
task_id_t dep_job_id) {
- if (dep_type == crane::grpc::DependencyType::AFTER &&
- status == crane::grpc::TaskStatus::Running) {
- // already satisfied
- g_task_scheduler->AddDependencyEvent(dep_job_id, task_id, start_time);
- } else {
- dependents[dep_type].push_back(dep_job_id);
- }
+ if (dep_type == crane::grpc::DependencyType::AFTER &&
+ start_time != absl::InfiniteFuture()) {
+ // already satisfied (job has started at some point)
+ g_task_scheduler->AddDependencyEvent(dep_job_id, task_id, start_time);
+ } else {
+ dependents[dep_type].push_back(dep_job_id);
+ }
}🤖 Prompt for AI Agents
In src/CraneCtld/CtldPublicDefs.cpp around lines 1009 to 1017, the code treats
AFTER dependencies as satisfied only when status == Running, which misses
dependents added after a job has already started and finished; change the
condition to check whether the job has ever started (start_time != 0) and, if
so, call g_task_scheduler->AddDependencyEvent(dep_job_id, task_id, start_time)
immediately; otherwise, append the dep_job_id to dependents[dep_type] as before.
3ef3692 to
a4e0183
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
507-534: Fix alignment inconsistency betweentask_id_listandcode_listinSubmitBatchTasksThe current implementation treats
code_listinconsistently:
- Early error case (line 502–503): Adds entries to both
task_id_listandcode_list- Success case (line 525): Adds to
task_id_listonly- Error cases in loop (lines 528, 532): Adds 0 to
task_id_listand error code tocode_listThis creates misaligned lists. Either:
- Make lists always 1:1 aligned by appending
SUCCESS(or a success code) whenever a task succeeds, or- Keep
code_listsparse (errors only) and handle the early error case consistently with the loopThe
ErrCode::SUCCESS = 0enum value suggests alignment may be intended.
♻️ Duplicate comments (1)
src/CraneCtld/DbClient.cpp (1)
693-743: GuardFetchJobStatusagainst step-only documents by filtering onhas_job_infoThe query currently matches any document with a
task_idin the$inlist, including step-only documents created byInsertStepsthat lackstate/exit_code/time_start/time_end. Accessing those fields will throw, hit the catch, and you’ll return a map indistinguishable from “no such jobs”.To align with
FetchJobRecords()and avoid silently dropping those entries, reuse thehas_job_infomarker so only full job docs are selected:- try { - document filter; - filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) { + try { + document filter; + // Only query documents that contain full job information. + filter.append(kvp("has_job_info", true)); + filter.append(kvp("task_id", [&job_ids](sub_document task_id_doc) { array task_id_array; for (const auto& job_id : job_ids) { task_id_array.append(static_cast<std::int32_t>(job_id)); } task_id_doc.append(kvp("$in", task_id_array)); }));This keeps the projection and field reads unchanged but avoids exceptions and makes “no result” genuinely mean “no full job record”.
🧹 Nitpick comments (10)
docs/zh/command/calloc.md (1)
36-41: Clarify--dependencysyntax or reference job_dependency docThe brief grammar here is slightly simplified compared to
docs/zh/reference/job_dependency.md(e.g., delay on subsequent job_ids). Not wrong, but it may confuse users when they see more complete forms elsewhere.Consider either:
- Explicitly stating this is only a simplified overview, or
- Linking to “作业依赖 (Job Dependency)” for full syntax and examples instead of repeating partial grammar here.
docs/zh/command/crun.md (1)
13-21: Align--dependencyhelp text with central Job Dependency referenceThe inline grammar description is correct but abbreviated relative to
docs/zh/reference/job_dependency.md(which also covers delay placement, AND/OR semantics, error cases, etc.).To avoid divergence over time, consider trimming this to a short description and pointing readers to the “作业依赖 (Job Dependency)” reference page for full syntax and examples.
src/CraneCtld/CtldPublicDefs.h (2)
429-445: Dependency state modelling looks coherent; consider edge‑case semantics doc
DependenciesInJob::is_met/is_failedcombined withupdateandready_timegive clear AND/OR semantics and a way to mark “never satisfied” viaabsl::InfiniteFuture(). This matches howTaskSchedulernow usesInfiniteFuturefor failed/missing dependencies.Two follow‑ups I recommend:
- The
depsmap is currently public and is accessed directly fromTaskScheduler. Encapsulating it (e.g., exposing aForEachDepor getter returningconst&only) would make future invariants aroundready_timeandis_orsafer to maintain.- The interaction between “failed dependency” and
ready_time == InfiniteFuture()is subtle. A short comment nearis_failed()orupdate()clarifying thatInfiniteFuturemeans “dependency can never be satisfied” (and thus leads toDependencyNeverSatisfied) would help future maintainers reason about the design.
762-776: TaskInCtld dependency fields fit the designThe per‑type
dependents[DependencyType_ARRAYSIZE]and the embeddedDependenciesInJob dependencies;provide exactly whatTaskSchedulerneeds (both outgoing deps and reverse edges). Layout (in [2]/[3] persisted section) is consistent with other persisted runtime fields.No functional issues spotted here. You may later want a brief comment explaining that
dependentsis not persisted (only in‑RAM wiring), whiledependenciesis persisted viaTaskToCtld/ DB, but that’s optional.src/CraneCtld/TaskScheduler.cpp (5)
398-487: Startup dependency reconstruction is sound; consider logging for ambiguous AFTER casesThe recovery block that rebuilds dependents and seeds dependency events from MongoDB does the right things:
- Rewires
dependentsacross pending/running tasks.- Collects missing dependees into
missing_dependee_idsand bulk‑fetches their final(status, exit_code, time_end, time_start).- Emits per‑type events with sensible fallbacks:
AFTER: preferstime_start, falls back totime_endforCancelled, logs if neither exists.AFTER_ANY/AFTER_OK/AFTER_NOT_OK: usetime_endwith guards on status/exit_code.- Missing DB row ⇒ warn and leave
event_timeasInfiniteFuture(so the dependent will eventually be markedDependencyNeverSatisfied).Two minor improvements you might consider:
- For
AFTERon a completed (non‑cancelled) job withtime_start == 0buttime_end != 0, current code logs an error and never setsevent_time, effectively treating it as “never satisfied”. If this is expected, a more explicit log message (“treating dependency as unsatisfiable”) would be clearer; otherwise, falling back totime_endmight be more user‑friendly.- You already batch DB access per
missing_dependee_ids; good. If dependencies grow large, a brief debug trace summarizing how many jobs ended up withInfiniteFuturetimestamps could help debugging “DependencyNeverSatisfied” states.Overall logic is correct; suggestions are optional.
1678-1745: Two‑layer CraneExpected + future return type is a bit heavy but consistent
SubmitTaskToSchedulernow returnsCraneExpected<std::future<CraneExpected<task_id_t>>>, so:
- Outer
CraneExpectedcaptures synchronous validation failures (UID, account, partition, QOS, basic validity).- Inner
CraneExpected(inside the future) reports asynchronous persistence/queueing failures or delivers the allocatedtask_id.This separation is logically clean and keeps call‑sites explicit about what failed when. It is a bit verbose for callers but consistent with your error‑handling style; no functional issues.
You might eventually want a thin helper to unwrap this pattern at the RPC layer, but not required in this PR.
2882-2940: Dependency wiring on submission is mostly correct; edge cases handled; minor suggestionsThe new block in
CleanSubmitQueueCb_that wires dependents at submission time:
- For each accepted task:
- Iterates its
job->Dependencies().deps.- If dependee is in
m_pending_task_map_, callsAddDependent(dep_type, id).- Else pushes
(dep_type, dep_job_id)intorunning_deps.- Then under
m_running_task_map_mtx_:
- Adds dependents for any dependees found in
m_running_task_map_.- Flags
missing_depsif any dependency is neither pending nor running.- On
missing_deps:
- Logs a warning.
- Frees QoS resources for the new job.
- Fulfills its promise with
ERR_MISSING_DEPENDENCY.- Records
(task_id, TaskDbId)intasks_to_purge.- After the loop:
- Updates
m_pending_map_cached_size_and callsPurgeEndedTasks(tasks_to_purge).This is a solid design:
- Prevents dangling dependencies on non‑existent jobs.
- Cleans up embedded DB entries for rejected submissions.
- Avoids holding the running‑map lock longer than needed.
Two small suggestions:
- The name
ERR_BEYOND_TASK_IDused later for capacity rejection is slightly cryptic; a more explicit code likeERR_PENDING_QUEUE_FULLwould be clearer (if you’re free to add new error codes).- You might want to log the offending dependency IDs when a job is rejected for missing dependencies to help users debug typos in
--dependencyjob IDs.Functionally, this looks good.
2951-2956: Capacity‑rejection now returns structured errors correctlyFor tasks rejected because the pending queue is beyond capacity:
- QoS resources are freed.
- The promise is set to
std::unexpected(CraneErrCode::ERR_BEYOND_TASK_ID).This matches the new CraneExpected usage and prevents resource leaks. Only nit is the error code naming as noted above; semantics are correct.
3135-3158: End‑time normalization and dependency events on job completion are well‑designedOn finalization of a job in
CleanTaskStatusChangeQueueCb_you now:
- Derive
end_timefrom the reported timestamp.- Clamp it to
StartTime + time_limit+ 5 seconds tolerance to handle clock skew / delayed status.- Call
SetEndTime(end_time).- Emit dependency events:
AFTER_ANYatend_time.AFTER_OKatend_timeifexit_code == 0, else atInfiniteFuture().AFTER_NOT_OKatend_timeifexit_code != 0, else atInfiniteFuture().This integrates cleanly with
DependenciesInJob:
- Satisfied dependencies get a real timestamp.
- Unsatisfied ones get
InfiniteFuture, leading toDependencyNeverSatisfiedif/when all deps resolve that way.The only minor concern is relying solely on
exit_code(ignoringstatus) for OK vs NOT_OK; if you ever introduce special non‑0 exit codes that should still count as “OK”, you’ll need to adjust this mapping. For current semantics, it’s fine.src/CraneCtld/TaskScheduler.h (1)
782-783: SubmitTaskToScheduler’s return type is explicit and consistentDeclaring
CraneExpected<std::future<CraneExpected<task_id_t>>> SubmitTaskToScheduler(...)in the header matches the implementation in the.cpp. The signature is admittedly a bit nested, but it cleanly separates synchronous validation from asynchronous persistence/queuing failures.If you find this pattern leaks into many call‑sites, a small wrapper type (e.g.,
using SubmitTaskResult = CraneExpected<std::future<CraneExpected<task_id_t>>>;) could improve readability, but not required.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
docs/en/command/calloc.md(1 hunks)docs/en/command/cbatch.md(1 hunks)docs/en/command/crun.md(1 hunks)docs/en/reference/job_dependency.md(1 hunks)docs/en/reference/pending_reason.md(1 hunks)docs/zh/command/calloc.md(1 hunks)docs/zh/command/cbatch.md(1 hunks)docs/zh/command/crun.md(1 hunks)docs/zh/reference/job_dependency.md(1 hunks)docs/zh/reference/pending_reason.md(1 hunks)mkdocs.yaml(2 hunks)protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.cpp(5 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(1 hunks)src/CraneCtld/DbClient.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(13 hunks)src/CraneCtld/TaskScheduler.h(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- docs/en/command/crun.md
- docs/zh/command/cbatch.md
- docs/en/command/cbatch.md
- docs/en/command/calloc.md
- src/CraneCtld/CtldPublicDefs.cpp
- src/CraneCtld/DbClient.h
- mkdocs.yaml
- docs/en/reference/job_dependency.md
🧰 Additional context used
🧠 Learnings (18)
📓 Common learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/CtldPublicDefs.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.h
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/CraneCtld/TaskScheduler.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.
Applied to files:
src/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (4)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
src/Craned/Core/SupervisorKeeper.cpp (1)
pair(123-183)
src/CraneCtld/CtldPublicDefs.h (1)
src/CraneCtld/CtldPublicDefs.cpp (10)
update(122-137)update(122-122)SetDependency(1172-1183)SetDependency(1172-1172)UpdateDependency(1185-1187)UpdateDependency(1185-1185)AddDependent(1189-1198)AddDependent(1189-1190)TriggerDependencyEvents(1200-1205)TriggerDependencyEvents(1200-1201)
src/CraneCtld/TaskScheduler.cpp (1)
src/CraneCtld/TaskScheduler.h (1)
AddDependencyEvent(911-917)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
SubmitTaskAsync(1523-1532)SubmitTaskAsync(1523-1524)SubmitTaskToScheduler(1679-1746)SubmitTaskToScheduler(1680-1680)
🪛 LanguageTool
docs/zh/reference/job_dependency.md
[uncategorized] ~69-~69: 数词与名词之间一般应存在量词,可能缺少量词。
Context: ...100 开始 并且 作业 101 成功完成。 #### OR 逻辑(任一条件满足) 使用 ? 分隔不同的依赖条件: ```bash afterok:...
(wa5)
🪛 markdownlint-cli2 (0.18.1)
docs/zh/reference/pending_reason.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/zh/reference/job_dependency.md
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
164-164: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/en/reference/pending_reason.md
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (14)
protos/PublicDefs.proto (2)
135-163: Dependency proto additions look structurally sound and wire‑compatibleThe new dependency modeling (DependencyType/Condition/Dependencies/DependencyStatus), plus wiring into
TaskToCtld.dependencies = 15andTaskInfo.dependency_status = 29, is consistent and backwards compatible:
- Field numbers 15 and 29 were previously unused/reserved, and the reserved comment is adjusted accordingly.
Dependencies.is_ormatches the documented,(AND) vs?(OR) semantics in the new job dependency docs.DependencyStatus’sready_time_typeoneof withready_time/infinite_future/infinite_pastis a reasonable encoding; the bool arms are effectively enum sentinels inside the oneof.No changes needed here from a proto/schema perspective.
Also applies to: 186-186, 448-485
724-730: New ErrCode values are appended safely and match usageAdding
ERR_MISSING_DEPENDENCY = 85;andERR_DB_INSERT_FAILED = 86;at the end of theErrCodeenum preserves existing numeric meanings and gives clear, specific codes for dependency resolution and DB failures.Assuming these are wired through to callers (as seen in scheduler/DB changes), this is a good refinement of error reporting.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
319-333: Correctly unwraps nestedCraneExpectedresult fromSubmitTaskToSchedulerin CforedStreamThe new logic:
- Checks the outer
CraneExpected<std::future<CraneExpected<task_id_t>>>,- Calls
.get()only whenhas_value(),- Then inspects the inner
CraneExpected<task_id_t>and converts it tostd::expected<std::pair<job_id_t, step_id_t>, std::string>usingCraneErrStr,is consistent with the updated TaskScheduler API and preserves the previous success/error behavior for interactive tasks.
471-484: SubmitBatchTask correctly propagates nested scheduler errorsFor single batch submission:
- Outer failure (
result.has_value() == false) setsok=falseandcode=result.error().- Inner failure (
task_result.has_value() == false) setsok=falseandcode=task_result.error().- Only on full success is
ok=trueandtask_idset.This matches the new
CraneExpected<std::future<CraneExpected<task_id_t>>>signature and cleanly separates early-submit vs execution‑time errors.src/CraneCtld/CtldPublicDefs.h (1)
1018-1023: TaskInCtld dependency API is minimal and appropriate
SetDependency,UpdateDependency,Dependencies() const,AddDependent, andTriggerDependencyEventsgive a clean surface for the scheduler and recovery paths. The implementation in CtldPublicDefs.cpp matches the intended semantics (AND/OR, eagerAFTERhandling, and event enqueuing).No changes needed from my side.
src/CraneCtld/TaskScheduler.cpp (5)
894-910: Dependency event draining in ScheduleThread_ is correct and cheapThe scheduler now drains
m_dependency_event_queue_once per loop while holdingm_pending_task_map_mtx_and invokesUpdateDependencyon matching pending jobs. Usingsize_approx()+try_dequeue_bulkavoids contention, and limiting updates to pending tasks is logically correct (dependents must be pending to care).No concurrency or correctness issues spotted.
923-930: Pending‑reason for dependency gating is well‑definedUsing
Dependencies().is_met(now)to decide schedulability, and setting:
"DependencyNeverSatisfied"whenis_failed()is true, else"Dependency"gives users clear feedback in
pending_reason. This lines up cleanly with theInfiniteFuturenotion of “unsatisfiable dependency”.Looks good.
1336-1337: Triggering AFTER events at job start is correctEmitting
DependencyType::AFTERwithjob->StartTime()when a job entersConfiguring/running ensures dependent jobs of typeAFTERsee the earliest start timestamp. This matches the intended semantics for “run after job starts”.No further changes needed.
1523-1532: SubmitTaskAsync’s new CraneExpected wrapping is consistentSwitching
SubmitTaskAsyncto returnstd::future<CraneExpected<task_id_t>>and propagating promises of the same type aligns withSubmitTaskToSchedulerand the broader CraneExpected adoption. The lifetime and queueing behavior remain the same.All good here.
2823-2859: Submit queue element type change and CraneExpected propagation are fine
SubmitQueueElemis nowstd::pair<std::unique_ptr<TaskInCtld>, std::promise<CraneExpected<task_id_t>>>, and accepted tasks have their promises fulfilled with either:
std::unexpected(ERR_DB_INSERT_FAILED)when appending to embedded DB fails, or- The allocated
task_idlater in the loop.The reverse‑order walking and
SetStatus(Pending)logic are unchanged, and this fits with the new submit API.Looks correct.
src/CraneCtld/TaskScheduler.h (4)
764-768: SubmitTaskAsync documentation matches CraneExpected semanticsThe comment clarifies that the returned future either contains an error code or a newly allocated task ID. This is exactly what the new
std::future<CraneExpected<task_id_t>>type does; no mismatch between declaration, comment, and implementation.
910-917: AddDependencyEvent inline helper aligns with failure semanticsThe inline
AddDependencyEventthat enqueues aDependencyEvent{dependent_job_id, dependee_job_id, timestamp}is straightforward and used consistently. The comment explicitly states that failed dependencies should useabsl::InfiniteFuture(), matching howDependenciesInJob::is_failedis implemented.No issues.
1070-1073: Submit queue type updated correctly to CraneExpected
m_submit_task_queue_is nowConcurrentQueue<std::pair<std::unique_ptr<TaskInCtld>, std::promise<CraneExpected<task_id_t>>>>, matching both theSubmitTaskAsyncsignature andCleanSubmitQueueCb_implementation.Thread safety and ownership semantics remain unchanged.
1125-1132: DependencyEvent struct and queue integrate cleanly with scheduler loopDefining
DependencyEvent { dependent_job_id, dependee_job_id, event_time }and backing it withConcurrentQueue<DependencyEvent> m_dependency_event_queue_provides a simple, lock‑free conduit between dependency producers (job lifecycle, DB recovery) and the scheduler thread.Given:
- Producers only ever enqueue.
- Consumer drains in
ScheduleThread_underm_pending_task_map_mtx_.this design avoids contention and keeps ordering “good enough” for dependency semantics (exact total order is not required).
Looks solid.
| Example output: | ||
| ``` | ||
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
| 101 CPU job1 user1 PD 0:00 2 (Priority) | ||
| 102 CPU job2 user1 PD 0:00 4 (Resource) | ||
| 103 GPU job3 user2 PD 0:00 1 (Dependency) | ||
| 104 CPU job4 user1 PD 0:00 2 (Held) | ||
| ``` | ||
|
|
||
| ### Using ccontrol show job | ||
|
|
||
| ```bash | ||
| ccontrol show job 101 | ||
| ``` | ||
|
|
||
| Example output: | ||
| ``` | ||
| JobId=101 | ||
| ... | ||
| State=PENDING | ||
| Reason=Priority | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add fenced-code languages for example outputs (markdownlint MD040)
The two example output blocks currently use bare code fences, which triggers MD040 and is inconsistent with the rest of the page.
You can fix this by tagging them as plain text, e.g.:
-```
+```text
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
...@@
- +text
JobId=101
...
State=PENDING
Reason=Priority
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/en/reference/pending_reason.md around lines 15 to 37, the example output
code fences are missing language tags which triggers markdownlint MD040; update
both fenced code blocks to use a language tag (use "text") so they become
```text ... ``` for the JOBID table block and ```text ... ``` for the ccontrol
show job block, leaving the contents unchanged.
| ``` | ||
| <type>:<job_id>[+<delay>][:<job_id>[+<delay>]]... | ||
| ``` | ||
|
|
||
| ### 依赖类型 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tag generic fenced blocks with a language to satisfy markdownlint
markdownlint (MD040) flags the syntax and status example blocks that use bare code fences. The rest of the doc already uses bash where appropriate.
Recommend tagging these as plain text:
-```
+```text
<type>:<job_id>[+<delay>][:<job_id>[+<delay>]]...and:
-```
+```text
JobId=105
...
Dependency=PendingDependencies=afterok:100+01:00:00 Status=WaitForAllThis keeps semantics unchanged but silences MD040 and improves consistency.
Also applies to: 164-168
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/zh/reference/job_dependency.md around lines 27 to 31 (and similarly
update the fenced blocks at lines 164 to 168), the code fences are untagged
which trips markdownlint MD040; update those triple-backtick fenced blocks to
include a language tag (use text) so they read ```text ... ```; apply the same
change to both the syntax example and the status example at the other location
to keep consistency and silence the linter.
| 输出示例: | ||
| ``` | ||
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
| 101 CPU job1 user1 PD 0:00 2 (Priority) | ||
| 102 CPU job2 user1 PD 0:00 4 (Resource) | ||
| 103 GPU job3 user2 PD 0:00 1 (Dependency) | ||
| 104 CPU job4 user1 PD 0:00 2 (Held) | ||
| ``` | ||
|
|
||
| ### 使用 ccontrol show job 查看 | ||
|
|
||
| ```bash | ||
| ccontrol show job 101 | ||
| ``` | ||
|
|
||
| 输出示例: | ||
| ``` | ||
| JobId=101 | ||
| ... | ||
| State=PENDING | ||
| Reason=Priority | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify language for example output code fences (markdownlint MD040)
The cqueue and ccontrol output examples use bare fenced blocks, which markdownlint flags.
You can tag them as plain text:
-```
+```text
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
...and:
-```
+```text
JobId=101
...
State=PENDING
Reason=PriorityThis matches the English doc and keeps the content unchanged.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/zh/reference/pending_reason.md around lines 15 to 37, the example output
code fences are untagged and trigger markdownlint MD040; update both fenced
blocks (the cqueue and ccontrol examples) to use the plain-text language tag by
replacing the opening ``` with ```text so the examples are explicitly marked as
text and match the English doc.
| absl::Time end_time = absl::Now(); | ||
|
|
||
| if (!pending_task_ptr_vec.empty()) { | ||
| for (auto& task : pending_task_ptr_vec) { | ||
| task->SetStatus(crane::grpc::Cancelled); | ||
| task->SetEndTime(absl::Now()); | ||
| task->SetEndTime(end_time); | ||
| g_account_meta_container->FreeQosResource(*task); | ||
|
|
||
| task->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER, | ||
| end_time); | ||
| task->TriggerDependencyEvents(crane::grpc::DependencyType::AFTER_ANY, | ||
| end_time); | ||
| task->TriggerDependencyEvents(crane::grpc::AFTER_OK, | ||
| absl::InfiniteFuture()); | ||
| task->TriggerDependencyEvents(crane::grpc::AFTER_NOT_OK, | ||
| absl::InfiniteFuture()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancel‑pending behavior for dependency events may surprise AFTER_NOT_OK users
For cancelled pending tasks you now:
- Set status
Cancelledandend_time = now. - Emit:
AFTERandAFTER_ANYatend_time.AFTER_OKandAFTER_NOT_OKatabsl::InfiniteFuture().
Given how DependenciesInJob interprets InfiniteFuture, any dependent job with AFTER_OK or AFTER_NOT_OK on a cancelled‑pending job will end up with DependencyNeverSatisfied and never run automatically.
If the intended semantics are “AFTER_NOT_OK includes cancelled jobs” (which is how many schedulers model it), then for cancelled jobs you likely want:
AFTER_OKatInfiniteFuture()AFTER_NOT_OKatend_timeand a non‑zero exit code.
If, instead, you explicitly want cancelled‑pending jobs to neither satisfy AFTER_OK nor AFTER_NOT_OK (i.e., dependents must be resubmitted), it would be good to:
- Document this in the dependency docs and CLI help, and
- Perhaps log a warning when a job with such a dependency transitions to
"DependencyNeverSatisfied"so users can see why.
I recommend at least clarifying/confirming the intended semantics; behavior is consistent with your current is_failed logic but might be surprising.
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 2733-2749, the current
cancel-path emits AFTER and AFTER_ANY at end_time but emits AFTER_OK and
AFTER_NOT_OK at absl::InfiniteFuture(), which causes dependents waiting on
AFTER_NOT_OK to be treated as "never satisfied". Change the cancel behavior so
that for cancelled pending tasks you: keep AFTER and AFTER_ANY at end_time, keep
AFTER_OK at absl::InfiniteFuture(), but emit AFTER_NOT_OK at end_time (not
InfiniteFuture()) and set a non-zero exit code on the task before triggering
AFTER_NOT_OK; ensure g_account_meta_container->FreeQosResource is still called,
and add a concise log message indicating the task was cancelled and will satisfy
AFTER_NOT_OK so dependency resolution is correct.
15b94a8 to
11d8672
Compare
#172, reimplement of #308
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.