-
Notifications
You must be signed in to change notification settings - Fork 32
Add job env #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add job env #600
Conversation
📝 WalkthroughWalkthroughAdds protobuf fields for submit host/dir and total GPUs; enforces non-empty ClusterName in Ctld and Craned; exposes GPU-count API in public headers; populates additional job/step metadata and many per-node environment variables; tweaks memory-per-node string formatting. Changes
Sequence Diagram(s)sequenceDiagram
rect rgb(240,248,255)
participant User
participant Ctld
participant Craned
participant JobManager
participant CgroupMgr
note right of Ctld: Protobuf fields updated\n(TaskToCtld/StepToD)
end
User->>Ctld: submit job (includes submit_host/dir)
Ctld->>Ctld: validate ClusterName (exit if missing/empty)
Ctld->>Craned: send JobToD / StepToD (populate submit_hostname, submit_dir, total_gpus, account/qos/partition)
Craned->>JobManager: allocate nodes, compute per-node resources
JobManager->>JobManager: populate CRANE_* env vars (GPUS, MEM_PER_NODE, NTASKS_PER_NODE, etc.)
JobManager->>CgroupMgr: set cgroups (mem limits formatted as "NNM")
CgroupMgr-->>JobManager: ack
JobManager-->>Craned: launch step
Craned-->>Ctld: status updates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (6)
🧰 Additional context used🧠 Learnings (4)📚 Learning: 2025-05-09T01:56:53.142ZApplied to files:
📚 Learning: 2025-12-08T08:11:40.332ZApplied to files:
📚 Learning: 2025-06-30T08:43:44.470ZApplied to files:
📚 Learning: 2025-05-08T07:39:29.207ZApplied to files:
🧬 Code graph analysis (1)src/Craned/Core/JobManager.cpp (1)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (9)
src/Craned/Core/Craned.cpp (1)
268-277: Clarify error messages and trim whitespace for ClusterName.Differentiate “missing” vs “empty” for better operator feedback, and strip surrounding whitespace before validation. This avoids rejecting a value like " cluster " and produces clearer logs.
Apply within this hunk:
- if (config["ClusterName"]) { - g_config.CraneClusterName = config["ClusterName"].as<std::string>(); - if (g_config.CraneClusterName.empty()) { - CRANE_ERROR("Cluster name is empty."); - std::exit(1); - } - } else { - CRANE_ERROR("Cluster name is empty."); - std::exit(1); - } + if (config["ClusterName"]) { + std::string cluster = config["ClusterName"].as<std::string>(); + g_config.CraneClusterName = + std::string(absl::StripAsciiWhitespace(cluster)); + if (g_config.CraneClusterName.empty()) { + CRANE_ERROR("ClusterName is present but empty after trimming."); + std::exit(1); + } + } else { + CRANE_ERROR("ClusterName is not configured."); + std::exit(1); + }src/Craned/Core/CgroupManager.cpp (1)
755-758: Potential backward-compat change: CRANE_MEM_PER_NODE now includes a unit suffix.Appending “M” changes the variable from a pure integer to a human-formatted string. If users or scripts parse it numerically, this will break them. Consider keeping the original numeric env and adding a new human-readable one.
Option A (preserve numeric; add human-readable):
- env_map.emplace( - "CRANE_MEM_PER_NODE", - std::format("{}M", - res_in_node.allocatable_res_in_node().memory_limit_bytes() / - (1024ULL * 1024ULL))); + const auto mem_mib = + res_in_node.allocatable_res_in_node().memory_limit_bytes() / + (1024ULL * 1024ULL); + env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_mib)); + env_map.emplace("CRANE_MEM_PER_NODE_M", std::format("{}M", mem_mib));Option B (if unit suffix is required, prefer unambiguous “MiB”):
- env_map.emplace( - "CRANE_MEM_PER_NODE", - std::format("{}M", + env_map.emplace( + "CRANE_MEM_PER_NODE", + std::format("{}MiB", res_in_node.allocatable_res_in_node().memory_limit_bytes() / (1024ULL * 1024ULL)));src/CraneCtld/CraneCtld.cpp (1)
126-136: Deduplicate ClusterName assignment and improve validation messages.Cluster name is read at Lines 97–99 and again here. Keep a single, validated assignment (here, after logger init) and distinguish “not configured” vs “empty”. Also trim whitespace.
Apply within this hunk:
- if (config["ClusterName"]) { - g_config.CraneClusterName = config["ClusterName"].as<std::string>(); - if (g_config.CraneClusterName.empty()) { - CRANE_ERROR("ClusterName is empty."); - std::exit(1); - } - } else { - CRANE_ERROR("ClusterName is empty."); - std::exit(1); - } + if (config["ClusterName"]) { + std::string cluster = config["ClusterName"].as<std::string>(); + g_config.CraneClusterName = + std::string(absl::StripAsciiWhitespace(cluster)); + if (g_config.CraneClusterName.empty()) { + CRANE_ERROR("ClusterName is present but empty after trimming."); + std::exit(1); + } + } else { + CRANE_ERROR("ClusterName is not configured."); + std::exit(1); + }Additionally remove the earlier pre-validation assignment to avoid double writes (outside this hunk):
// Remove Lines 97–99: // if (config["ClusterName"]) // g_config.CraneClusterName = config["ClusterName"].as<std::string>();src/CraneCtld/CtldPublicDefs.cpp (1)
358-360: Clamptotal_gpusto avoid narrowing and consider DRYing out TaskToD constructionWe confirmed that
ResourceView::GpuCount()returns auint64_t, while theTaskToD.total_gpusprotobuf field is auint32. Although overflowing 2³² GPUs is unlikely in practice, it’s safer to clamp rather than rely on an implicit narrowing conversion.• At
src/CraneCtld/CtldPublicDefs.cpp:359, wrap the assignment in a saturating cast:- task_to_d.set_total_gpus(this->allocated_res_view.GpuCount()); + { + const auto g = this->allocated_res_view.GpuCount(); + task_to_d.set_total_gpus(static_cast<uint32_t>( + std::min<uint64_t>(g, std::numeric_limits<uint32_t>::max()))); + }• The same pattern appears in
src/CraneCtld/RpcService/CranedKeeper.cpp:338; applying the clamp there maintains consistency.Optional refactor: extract a helper (e.g.
ClampToUint32(uint64_t)) or centralize theTaskToDbuilder logic so both CtldPublicDefs and CranedKeeper invoke the same code path, reducing duplication and drift over time.protos/PublicDefs.proto (1)
171-172: Add brief field docs and confirm presence semantics for submit_hostname.Adding
submit_hostnameis fine and backward-compatible. Please add a one-line comment describing whether this is the submission node’s short host or FQDN and whether empty string is a valid/expected value. If you need to distinguish “unset” from “empty”, consideroptional stringin proto3.src/Utilities/PublicHeader/PublicHeader.cpp (1)
709-721: Avoid iterating two different temporary views; simplify accumulation; include .Using
std::accumulate(std::views::values(type_map).begin(), std::views::values(type_map).end(), ...)constructs two different temporary views, which is brittle. Also, ensure<numeric>is included forstd::accumulateif you keep it. A simple loop is clearer and avoids view lifetime gotchas.Apply this diff to simplify the implementation:
-uint64_t ResourceView::GpuCount() const { - auto it = device_map.find(kResourceTypeGpu); - if (it == device_map.end()) return 0; - - const auto& [untyped_count, type_map] = it->second; - - uint64_t type_sum = std::accumulate(std::views::values(type_map).begin(), - std::views::values(type_map).end(), - static_cast<uint64_t>(0)); - - return untyped_count + type_sum; -} +uint64_t ResourceView::GpuCount() const { + auto it = device_map.find(kResourceTypeGpu); + if (it == device_map.end()) return 0; + + const auto& [untyped_count, type_map] = it->second; + uint64_t type_sum = 0; + for (const auto& [/*type*/, cnt] : type_map) type_sum += cnt; + return untyped_count + type_sum; +}Additionally, if
<numeric>is not already included transitively, add it near the top of this file:#include <numeric>src/Craned/Supervisor/TaskManager.cpp (3)
216-223: Duplicate env keys won’t override due to emplace semantics; use insert_or_assign.
CRANE_JOB_NODELISTandCRANE_EXCLUDESare already set in StepInstance::GetStepProcessEnv(). Usingemplacehere is a no-op if the keys exist. If the intent is “latest wins,” useinsert_or_assign.Apply this diff:
- env_map.emplace( + env_map.insert_or_assign( "CRANE_JOB_NODELIST", absl::StrJoin(this->m_parent_step_inst_->GetStep().allocated_nodes(), ";")); - env_map.emplace( + env_map.insert_or_assign( "CRANE_EXCLUDES", absl::StrJoin(this->m_parent_step_inst_->GetStep().excludes(), ";"));Optionally, do the same for the rest of the newly added CRANE_* keys in this block for consistent semantics.
237-237: Consider adding unit suffix to CRANE_MEM_PER_NODE for clarity.Elsewhere, memory values are often suffixed. If you want to expose MiB explicitly, append “M”.
Apply this diff:
- env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node)); + env_map.emplace("CRANE_MEM_PER_NODE", fmt::format("{}M", mem_in_node));
234-253: Consistent formatting approach needed
We currently mix std::format and fmt::format across multiple files (e.g. AccountManager.cpp, CgroupManager.cpp), and TaskManager.cpp indirectly includes<format>via crane/PublicHeader.h. We should settle on one strategy:• If we standardize on std::format (C++20), confirm our minimum C++ standard/toolchain support it and leave these calls unchanged.
• If we prefer fmt::format (bundled via spdlog), perform a repo-wide migration: replace allstd::formatcalls withfmt::format, add the necessary#include <spdlog/fmt/bundled/format.h>where needed, and ensure build flags pick up the bundled fmt headers.For this PR, please choose one of the following:
Option A: Keep std::format (no change)
Option B: Switch TaskManager.cpp to fmt::format--- a/src/Craned/Supervisor/TaskManager.cpp +++ b/src/Craned/Supervisor/TaskManager.cpp @@ - env_map.emplace( - "CRANE_CPUS_PER_TASK", - std::format("{:.2f}", - this->m_parent_step_inst_->GetStep().cpus_per_task())); + env_map.emplace( + "CRANE_CPUS_PER_TASK", + fmt::format("{:.2f}", this->m_parent_step_inst_->GetStep().cpus_per_task())); @@ - env_map.emplace("CRANE_MEM_PER_CPU", std::format("{:.2f}", mem_per_cpu)); + env_map.emplace("CRANE_MEM_PER_CPU", fmt::format("{:.2f}", mem_per_cpu)); @@ - env_map.emplace("CRANE_CPUS_ON_NODE", std::format("{:.2f}", cpus_on_node)); + env_map.emplace("CRANE_CPUS_ON_NODE", fmt::format("{:.2f}", cpus_on_node));And at the top of
TaskManager.cpp(if choosing Option B):#include <spdlog/fmt/bundled/format.h>
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
protos/PublicDefs.proto(2 hunks)protos/Supervisor.proto(1 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CtldPublicDefs.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/Craned/Core/CgroupManager.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(1 hunks)src/Craned/Supervisor/Supervisor.cpp(1 hunks)src/Craned/Supervisor/SupervisorPublicDefs.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CraneCtld.cppsrc/Craned/Supervisor/Supervisor.cppsrc/Craned/Core/Craned.cppsrc/Craned/Supervisor/SupervisorPublicDefs.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/Supervisor.cpp
🧬 Code graph analysis (6)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(148-148)
src/Craned/Supervisor/Supervisor.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(148-148)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(148-148)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
src/Utilities/PublicHeader/PublicHeader.cpp (2)
GpuCount(709-720)GpuCount(709-709)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(148-148)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(148-148)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: format
🔇 Additional comments (7)
src/Craned/Supervisor/SupervisorPublicDefs.h (1)
86-87: Add Config.CraneClusterName: looks good; please confirm init-before-use.Field addition is consistent with existing naming. Since tasks read CRANE_CLUSTER_NAME from g_config, please verify Supervisor::InitFromStdin sets CraneClusterName before any TaskManager environment population runs.
src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)
91-92: Introduce kResourceTypeGpu constant: OK.Constant name and value align with usage in device_map. No concerns.
488-489: Add ResourceView::GpuCount(): OK; ensure headers in .cpp cover /.API surface looks consistent with CpuCount/MemoryBytes. Confirm PublicHeader.cpp includes the necessary headers for std::accumulate and std::views::values.
src/Craned/Core/JobManager.cpp (1)
477-477: Propagate cluster name to Supervisor – validation already in placeThe call to
init_req.set_crane_cluster_name(g_config.CraneClusterName);in JobManager.cpp correctly propagates the cluster name. Startup checks for a non-empty CraneClusterName already exist in both components:
- Craned (src/Craned/Core/Craned.cpp:269–273) logs an error and exits if
g_config.CraneClusterNameis empty.- CraneCtld (src/CraneCtld/CraneCtld.cpp:127–131) does the same validation in its startup sequence.
No additional checks are needed here.
src/Craned/Supervisor/Supervisor.cpp (1)
81-81: LGTM: Supervisor reads crane_cluster_name during init (pre-logger, pre-threads)The assignment is early in InitFromStdin and does not write to STDOUT before SupervisorReady, honoring the supervisor IPC contract. No issues.
protos/Supervisor.proto (1)
61-76: Confirmed C++ protobuf stubs include the new field
- In
src/Craned/Supervisor/Supervisor.pb.h, you’ll find the generated constant for field 12 (kCraneClusterNameFieldNumber = 12).- In
src/Craned/Supervisor/Supervisor.pb.cc, the accessor and serialization logic forcrane_cluster_name()is present (e.g. line 81) as expected.No further C++ stub files remain to update. If your service publishes additional language bindings (Go, Python, Java, etc.), please regenerate and commit those stubs as well to prevent runtime/schema mismatches.
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
337-339: I've provided a script to locate and inspect the protobuf file definingTaskToDand itstotal_gpusfield so we can confirm its type. Let me know the results, and then I can finalize the guidance on saturating the cast and/or refactoring.
|
这个能不能再推进一下 |
5c57821 to
80d9630
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Common/CgroupManager.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/Craned/Common/CgroupManager.cpp
- src/CraneCtld/CtldPublicDefs.cpp
- protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CraneCtld.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (3)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
🔇 Additional comments (4)
src/CraneCtld/CraneCtld.cpp (1)
127-136: LGTM! Proper cluster name validation added.The validation logic correctly ensures ClusterName is present and non-empty before proceeding with configuration. This aligns with the corresponding validation in src/Craned/Core/Craned.cpp.
src/Craned/Core/Craned.cpp (1)
376-385: LGTM! Consistent cluster name validation.The validation logic mirrors the implementation in src/CraneCtld/CraneCtld.cpp, ensuring ClusterName is properly validated across both components.
src/Craned/Core/JobManager.cpp (2)
58-78: LGTM! Node ID calculation is properly implemented.The lambda correctly computes the node index by matching the hostname against the nodelist, with appropriate error handling that returns "-1" for invalid cases.
112-129: Environment variables correctly populated.The new environment variables provide comprehensive job metadata including submission context (CRANE_SUBMIT_HOST, CRANE_SUBMIT_DIR, CRANE_EXCLUDES), resource distribution (CRANE_GPUS_PER_NODE, CRANE_NTASKS_PER_NODE, CRANE_MEM_PER_NODE), and node-specific identifiers (CRANE_NODEID, CRANE_CLUSTER_NAME). The formatting and value sources are appropriate.
Note: CRANE_MEM_PER_CPU depends on the division operation flagged in the previous comment.
ac7226b to
b12aeb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)
58-78: Consider logging when hostname lookup fails.The lambda correctly handles the case where the current hostname is not found in the nodelist by returning "-1". However, this could indicate a configuration issue or unexpected state. Consider adding a log warning when this occurs to aid debugging.
Example:
if (node_idx >= daemon_step_to_d.nodelist().size()) { + CRANE_WARN("Hostname '{}' not found in job nodelist", host_name_view); return std::to_string(-1); // invalid }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
🔇 Additional comments (3)
src/CraneCtld/CtldPublicDefs.cpp (1)
335-337: LGTM! Metadata propagation is consistent.The changes correctly populate account, qos, partition, submit_hostname, total_gpus, cwd, ntasks_per_node, cpus_per_task, and exclude_nodelist fields from the job object into JobToD and StepToD messages. The initialization and recovery paths are also updated consistently.
Also applies to: 376-378, 420-427, 536-538
src/Craned/Core/JobManager.cpp (2)
80-93: Resource calculations look correct.The division by zero protection on lines 91-93 properly addresses the past review comment by checking
std::abs(cpus_on_node) > 1e-8before dividing. The GPU per-node calculation on line 83 uses integer division, which will truncate if GPUs aren't evenly distributed across nodes (e.g., 5 GPUs / 2 nodes = 2 GPUs per node). This is likely acceptable behavior for a scheduling system.
114-131: LGTM! Environment variable expansion is comprehensive.The new environment variables provide useful job metadata including excludes, submit directory, per-task/per-node resource allocations, GPU counts, cluster name, node ID, and submit host. All values are computed from validated sources and properly formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/CraneCtld/TaskScheduler.cpp(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
src/CraneCtld/TaskScheduler.cpp
Outdated
| if (!task->password_entry) { | ||
| CRANE_DEBUG("password_entry empty"); | ||
| return std::unexpected(CraneErrCode::ERR_INVALID_UID); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade the log level and improve the log message.
A null password_entry at this point indicates either a programming error or a serious validation failure upstream. CRANE_DEBUG is too low for this condition—it should be at least CRANE_WARN or CRANE_ERROR to ensure visibility in production logs.
Additionally, the log message should include task context (e.g., task ID, user ID) to aid debugging.
Apply this diff to improve logging:
if (!task->password_entry) {
- CRANE_DEBUG("password_entry empty");
+ CRANE_ERROR("Task submission failed: password_entry is null for uid {}. Task ID: {}",
+ task->uid, task->TaskId());
return std::unexpected(CraneErrCode::ERR_INVALID_UID);
}Additionally, consider verifying that all upstream code paths properly initialize password_entry:
#!/bin/bash
# Description: Find where TaskInCtld objects are created and verify password_entry initialization
# Search for TaskInCtld constructor calls and password_entry assignments
ast-grep --pattern $'new TaskInCtld($$$)'
# Search for password_entry field assignments
rg -n "password_entry\s*=" --type=cpp -C 3🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1375 to 1378, the code logs a
missing password_entry at CRANE_DEBUG without context; change the log to
CRANE_ERROR (or at minimum CRANE_WARN) and include task context such as task->id
and task->user_id (or equivalent fields) in the message to make the failure
visible and actionable, return the same error after logging; additionally, run
an audit of TaskInCtld creation sites and any assignments to password_entry
(using the suggested rg/ast-grep commands) to ensure upstream code always
initializes password_entry and add defensive checks where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
CMakeLists.txt(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Build, Upload and Test
CMakeLists.txt
[error] 190-190: CRANE_ADDRESS_SANITIZER and CRANE_THREAD_SANITIZER cannot be enabled at the same time.
CMakeLists.txt
Outdated
| option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON) | ||
|
|
||
| option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF) | ||
| option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Conflicting sanitizer defaults break the build.
ThreadSanitizer and AddressSanitizer require separate QA runs and cannot be enabled together. Setting both options to ON directly violates the constraint at line 189–191, causing an immediate fatal CMake error on any configuration.
Expected behavior: Only one sanitizer should be enabled at a time by default. The pipeline failure confirms this is a blocker.
Clarify the PR intent: Does this revert the sanitizer enablement (should both be OFF), or does it require a different approach to select one sanitizer?
If the intent is to enable sanitizers by default, either:
- Set one to
ONand the other toOFF, or - Add logic to auto-select based on build type or compiler
- option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON)
+ option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" OFF)
- option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON)
+ option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON) | |
| option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF) | |
| option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON) | |
| option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" OFF) | |
| option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF) |
a1da580 to
7bc4136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)
1375-1378: Improve logging for nullpassword_entryguardThe null-check itself is correct and avoids a crash, but a missing
password_entryat this point indicates a serious validation/programming issue. Logging onlyCRANE_DEBUG("password_entry empty")without any task context makes this very hard to diagnose in production.Consider upgrading the log level and including at least
task->uidandtask->TaskId():- if (!task->password_entry) { - CRANE_DEBUG("password_entry empty"); - return std::unexpected(CraneErrCode::ERR_INVALID_UID); - } + if (!task->password_entry) { + CRANE_ERROR( + "Task submission failed: password_entry is null for uid {} (task #{})", + task->uid, task->TaskId()); + return std::unexpected(CraneErrCode::ERR_INVALID_UID); + }
🧹 Nitpick comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
420-428: Inconsistency confirmed: use member variables for alignment with CommonStepInCtld pattern.The verification confirms that both
ntasks_per_nodeandcpus_per_taskare member variables of theDaemonStepInCtldclass (declared in CtldPublicDefs.h lines 466–467 and initialized at lines 298–299 from job data).Lines 423–424 currently access these fields through the job pointer, while the similar
CommonStepInCtld::GetStepToD()method at lines 661–662 uses member variables directly. This inconsistency should be resolved by using the member variables, which are already properly initialized and maintained.All referenced APIs are confirmed to exist:
ResourceView::GpuCount()is declared and implementedsubmit_hostnamefield exists in protobuf messages (PublicDefs.proto)- Both classes have the necessary member variables
Apply this diff for consistency:
step_to_d.set_submit_hostname(job->TaskToCtld().submit_hostname()); step_to_d.set_total_gpus(this->requested_node_res_view.GpuCount()); step_to_d.set_cwd(this->job->cwd); - step_to_d.set_ntasks_per_node(this->job->ntasks_per_node); - step_to_d.set_cpus_per_task(this->job->TaskToCtld().cpus_per_task()); + step_to_d.set_ntasks_per_node(this->ntasks_per_node); + step_to_d.set_cpus_per_task(static_cast<double>(this->cpus_per_task));
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (2)
src/CraneCtld/CtldPublicDefs.cpp (1)
376-379: LGTM! Job metadata correctly propagated to JobToD.The account, qos, and partition fields are properly initialized in both
InitFromJobandRecoverFromDb, ensuring these values are correctly propagated to the daemon.src/Craned/Core/JobManager.cpp (1)
54-133: Per-node job env additions and submit-host env look correctThe new per-node resource calculations (including the guarded
mem_per_cpudivision) andCRANE_SUBMIT_HOSTderived fromsubmit_hostname()are consistent with the existing resource/env model and avoid division-by-zero or out-of-range issues. No functional problems spotted here.
7bc4136 to
161474b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
501-501: Consider removing commented-out code.The commented-out
GpuCount()method declaration should either be removed entirely if it's no longer needed, or restored with proper implementation if GPU count tracking is required. Commented-out code without explanation is generally considered dead code and makes maintenance harder.If this method is definitively unused, apply this diff:
uint64_t MemoryBytes() const; - // uint64_t GpuCount() const; AllocatableResource& GetAllocatableRes() { return allocatable_res; }src/Craned/Core/JobManager.cpp (2)
58-78: Consider commenting out the unused lambda.The
node_id_to_strlambda is defined but never invoked—its only usage at Line 131 is commented out. Consider commenting out or removing this lambda definition to avoid confusion and reduce code clutter until the feature is ready to be enabled.
80-93: Consider commenting out unused calculations.The calculations for
gpus_per_node,mem_in_node,cpus_on_node, andmem_per_cpuare correct and include proper guards against division by zero (addressing the past review concern). However, these values are largely unused since the correspondingenv_map.emplacecalls at Lines 114-132 are commented out. Consider commenting out these calculations as well to avoid wasting CPU cycles until the environment variables are ready to be enabled.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/JobManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Utilities/PublicHeader/PublicHeader.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🔇 Additional comments (3)
src/CraneCtld/CtldPublicDefs.cpp (2)
376-379: LGTM!The population of JobToD fields (account, qos, partition) from the job object is straightforward and correct.
420-428: Verify the commented-outtotal_gpusfield.The StepToD population looks correct. However, Line 421 has
set_total_gpuscommented out, which suggests this feature may not be fully implemented yet. If this field is intended to be used, please verify whether it should be enabled or if there are blockers preventing its activation.src/Craned/Core/JobManager.cpp (1)
114-132: Verify C++20 support before enabling commented code.The commented-out environment variable assignments use
std::format(e.g., Line 118, 125, 130), which requires C++20. Before uncommenting this code, ensure the project is compiled with C++20 or later. If C++20 is not available, consider usingfmt::formatfrom the {fmt} library orstd::to_stringwith manual precision control.
161474b to
92fc772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
95-96: Public constant added for incomplete GPU feature—already flagged in previous review.As identified in the previous review,
kResourceTypeGpuis only referenced in commented-out code (line 710 of PublicHeader.cpp). Adding this to the public API surface when the GPU counting feature is incomplete creates confusion and violates API stability principles.
🧹 Nitpick comments (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
501-501: Remove commented-out code rather than keeping it in source.Commented-out declarations clutter the codebase. If
GpuCount()is not ready for use, remove the declaration entirely rather than commenting it out. Version control preserves the history if you need to restore it later.src/Utilities/PublicHeader/PublicHeader.cpp (1)
709-720: Remove commented-out implementation rather than keeping it in source.The entire
GpuCount()implementation is commented out. If this feature is incomplete or deferred, remove the code entirely rather than leaving it commented. Version control maintains the history if you need to restore it in the future. Keeping commented blocks creates maintenance overhead and can confuse readers about which code is actually active.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/JobManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Craned/Core/JobManager.cpp
- src/CraneCtld/CtldPublicDefs.cpp
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
f0ea766 to
cfd34b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/Craned/Core/Craned.cpp (1)
376-385: Remove commented-out code or add a TODO comment explaining the deprecation.Verification confirms that
g_config.CraneClusterNameis actively used in CTLD (event serialization in CranedMetaContainer.cpp and CtldGrpcServer.cpp) and safely initialized there. In CRANED, both the initialization and usage are consistently commented out (JobManager.cpp:129), so no runtime errors occur. However, leaving large blocks of commented code without explanation creates maintenance burden. Consider either:
- Remove the commented block entirely if this is permanent deprecation in CRANED, or
- Add a TODO comment explaining why this code is intentionally disabled (e.g., "TODO: Remove after v1.0 release" or "Feature flag for future cluster-name support")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/CraneCtld/CraneCtld.cpp(3 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Craned/Core/JobManager.cpp
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/CraneCtld.cppsrc/Craned/Core/Craned.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CraneCtld.cppsrc/Craned/Core/Craned.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (1)
src/CraneCtld/CraneCtld.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/String.cpp (2)
CalcConfigCRC32(431-433)CalcConfigCRC32(431-431)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: prepare
🔇 Additional comments (3)
src/Craned/Core/Craned.cpp (1)
253-253: LGTM: Configuration CRC tracking added.The CRC32 calculation of the configuration is appropriately placed immediately after loading the YAML config, enabling configuration version tracking or change detection.
src/CraneCtld/CtldPublicDefs.cpp (1)
376-379: Uncomment and populate account, qos, and partition fields in JobToD message.These fields are actively used by Craned nodes to set environment variables (CRANE_JOB_ACCOUNT, CRANE_JOB_PARTITION, CRANE_JOB_QOS) for job execution. Keeping them commented out will cause these environment variables to be unset, breaking job accounting, QoS enforcement, and partition-specific configurations on execution nodes.
Restore the assignments at lines 376-379:
job_to_d.set_account(job->account); job_to_d.set_qos(job->qos); job_to_d.set_partition(job->partition_id);⛔ Skipped due to learnings
Learnt from: L-Xiafeng Repo: PKUHPC/CraneSched PR: 496 File: src/CraneCtld/CtldPublicDefs.h:756-763 Timestamp: 2025-05-09T01:56:53.142Z Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.src/CraneCtld/CraneCtld.cpp (1)
101-102: CRC calculation location is correct.The CRC is properly calculated immediately after loading the config file at line 101. Subsequent code only reads from the
configobject (lines 103+:config["ClusterName"],config["CraneBaseDir"], etc.) with no modifications. This design is consistent with the parallel implementation inCraned.cpp:254, where the same pattern applies. Calculating the CRC early ensures it captures the raw file state for integrity verification between services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/CraneCtld/CraneCtld.cpp (1)
133-143: Commented‑outClusterNamevalidation weakens config safetyThis block previously enforced a non‑empty
ClusterNameand failed fast otherwise; nowCraneCtldwill start even ifClusterNameis missing/empty, leavingg_config.CraneClusterNameat its default or an empty string. That can break identification in multi‑cluster or event/metrics consumers that expect a real cluster name.If
ClusterNameis still conceptually required, consider restoring this validation (and potentially removing any now‑redundant simple assignment) so misconfigured clusters don’t start silently:- // if (config["ClusterName"]) { - // g_config.CraneClusterName = - // config["ClusterName"].as<std::string>(); if - // (g_config.CraneClusterName.empty()) { - // CRANE_ERROR("ClusterName is empty."); - // std::exit(1); - // } - // } else { - // CRANE_ERROR("ClusterName is empty."); - // std::exit(1); - // } + if (config["ClusterName"]) { + g_config.CraneClusterName = + config["ClusterName"].as<std::string>(); + if (g_config.CraneClusterName.empty()) { + CRANE_ERROR("ClusterName is empty."); + std::exit(1); + } + } else { + CRANE_ERROR("ClusterName is empty."); + std::exit(1); + }If the intent is to make
ClusterNameoptional, it would be good to document that and double‑check all downstream uses to ensure empty is handled correctly.
🧹 Nitpick comments (1)
src/CraneCtld/CraneCtld.cpp (1)
106-110: Remove duplicateCraneBaseDirassignment
g_config.CraneBaseDiris assigned twice in a row with the same expression; the second assignment is redundant noise.You can safely keep a single assignment:
- g_config.CraneBaseDir = - YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir); - - g_config.CraneBaseDir = - YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir); + g_config.CraneBaseDir = + YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/CraneCtld.cpp(2 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CraneCtld.cpp
🧬 Code graph analysis (1)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)
95-95: Commenting outkResourceTypeGpucorrectly retracts unfinished public APIThis line is now purely a comment, so the GPU resource type constant is no longer part of the public header/API, which aligns with reverting the unfinished GPU feature without impacting existing consumers.
501-501: Commented-outGpuCount()keeps ResourceView interface stableBy commenting out
GpuCount()while its implementation is inactive, theResourceViewpublic interface stays consistent and avoids exposing a non-functional method during this revert.
4886991 to
8be9527
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
95-95: Commented-out GPU API stubs are safe but could be cleaned upCommenting out
kResourceTypeGpuandResourceView::GpuCount()avoids exposing a half-finished GPU API in the public header, which matches the revert intent and addresses earlier concerns. If this functionality is not coming back soon, consider deleting these commented lines (or moving any future GPU helpers to an internal header) to keep the public surface minimal and avoid confusing dead code.Also applies to: 470-517
src/Utilities/PublicHeader/PublicHeader.cpp (1)
709-720: GpuCount implementation correctly disabled; consider full removal
ResourceView::GpuCount()is now entirely commented out, consistent with the header and with reverting unfinished GPU-counting behavior. From a runtime standpoint this is safe. To reduce noise, consider removing this block entirely (or reintroducing it only when the GPU feature is finalized) rather than keeping a large commented implementation.src/Craned/Core/JobManager.cpp (1)
58-93: Large commented env-var block: harmless but noisyThe added code in
GetJobEnvMapis fully commented out (helper lambda, GPU/CPU/memory calculations, and extraCRANE_*env entries), so current behavior is unchanged apart from reverting the extra envs. That’s safe for this revert.If the plan is to disable these env vars long-term, consider either:
- Removing the commented-out logic entirely, or
- Adding a brief TODO comment explaining that this block is a parked implementation tied to a future GPU/env feature or to the reverted commit ID.
This will keep the function easier to read for future maintainers.
Also applies to: 114-132
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CraneCtld.cpp(2 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- protos/PublicDefs.proto
- src/Craned/Core/Craned.cpp
- src/CraneCtld/CtldPublicDefs.cpp
- src/CraneCtld/CraneCtld.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
e17f6b6 to
e081879
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)
122-122: Duplicate issue:total_gpus()field access flagged above.This line also accesses
daemon_step_to_d.total_gpus(), which was flagged in the earlier review comment for lines 80-84.
🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)
58-78: Clarify or remove commented-out node ID code.Large blocks of commented-out code (the
node_id_to_strlambda and theCRANE_NODEIDenvironment variable) reduce readability. If this functionality is temporarily disabled or planned for future use, add a TODO comment explaining why. Otherwise, remove it to keep the codebase clean.Also applies to: 130-130
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (4)
src/Craned/Core/JobManager.cpp (4)
131-131: Original review comment is incorrect.The verification confirms that
submit_hostnameis defined in the protobuf schema (found at two locations in proto files). The code at line 131 in src/Craned/Core/JobManager.cpp correctly accesses this existing field viadaemon_step_to_d.submit_hostname(). No compilation error would result from this code.Likely an incorrect or invalid review comment.
80-84: No issues found. Thetotal_gpusfield exists in the protobuf definition and is properly accessible.The verification confirms that
total_gpusis defined in the proto file (line 333:uint32 total_gpus = 27;) and is successfully used in JobManager.cpp at lines 83 and 122. The original concern that the field was removed or inaccessible is incorrect. The code at lines 80-84 will compile and execute without errors.
128-128: ****The
CraneClusterNamefield is defined insrc/Craned/Core/CranedPublicDefs.hasstd::string CraneClusterName, and is initialized from config insrc/Craned/Core/Craned.cppduring startup. The usage at line 128 is valid and will not cause a compilation error. The concern raised in the review comment is not supported by the codebase.Likely an incorrect or invalid review comment.
114-115: The verification confirms thatexclude_nodelistdoes exist in the protobuf schema. The field is defined at line 334 asrepeated string exclude_nodelist = 28;, and the method call at line 115 in JobManager.cpp is valid.However, to complete the verification as originally requested by the review comment, I need to check the other potentially problematic fields mentioned in the AI summary. Let me verify those as well:
Now let me check the exact context and how
exclude_nodelist()is being used, since protobuf repeated string fields don't typically have afieldname()accessor:Perfect! The web search confirms that for a repeated string field, the generated accessor
foo() constreturns the underlyingRepeatedPtrField<string>&that stores the field's elements. This meansexclude_nodelist()is the correct way to access the field, and it will work withabsl::StrJoin.The verification confirms that the code at line 115 is correct—the
exclude_nodelist()method does exist in the protobuf and the usage withabsl::StrJoinis valid.Based on the script execution and web search results:
The review comment's concern about
exclude_nodelist()is unsubstantiated.The field is properly defined in the proto schema (line 334:
repeated string exclude_nodelist = 28;) and is being correctly accessed via the generated method. There are no compilation errors expected for this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/CraneCtld.cpp (1)
101-102: Remove or comment lines 101-102, or restore validation at lines 130-135 for consistent ClusterName handling.Lines 101-102 actively read and assign
ClusterNamefrom config, while the validation that checks for empty values (lines 130-135) is commented out. This allows emptyClraneClusterNameto reach event emission calls atCtldGrpcServer.cpp:1771andCranedMetaContainer.cpp:780.If disabling ClusterName handling (as suggested by the commented validation), lines 101-102 should also be commented or removed. If making ClusterName optional, restore the validation to prevent empty values from being passed to events. The current state is semantically inconsistent.
♻️ Duplicate comments (1)
src/CraneCtld/CraneCtld.cpp (1)
130-140: Duplicate concern: ClusterName validation removal.This change was already flagged in a previous review as potentially allowing invalid configurations. The commented-out validation ensured ClusterName was non-empty, preventing issues in multi-cluster deployments.
If ClusterName is being intentionally disabled across the system (consistent with the broader PR changes), this should be clearly documented. Otherwise, the validation should be restored.
🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)
177-177: Usereservedkeywords for proper protobuf field deprecation.The verification confirms that all usages of these fields are already commented out and no active code references them. However, the original recommendation to use protobuf's
reservedkeyword is sound: it prevents these field numbers and names from being accidentally reused in future development, which could cause wire-format compatibility issues.The suggested refactor remains appropriate:
message TaskToCtld { ... - // string submit_hostname = 40; + reserved 40; + reserved "submit_hostname"; } message StepToD { ... - // string submit_hostname = 26; - // uint32 total_gpus = 27; - // repeated string exclude_nodelist = 28; + reserved 26, 27, 28; + reserved "submit_hostname", "total_gpus", "exclude_nodelist"; }This follows protobuf deprecation best practices and prevents future issues with field reuse.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Common/CgroupManager.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- src/Craned/Core/JobManager.cpp
- src/CraneCtld/CtldPublicDefs.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
- src/Craned/Core/Craned.cpp
- src/Utilities/PublicHeader/PublicHeader.cpp
- src/Utilities/PublicHeader/include/crane/PublicHeader.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Common/CgroupManager.cppsrc/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/CraneCtld.cpp
🔇 Additional comments (1)
src/Craned/Common/CgroupManager.cpp (1)
669-680: ****The format change is correct. The documentation explicitly defines
CRANE_MEM_PER_NODEas an integer value in MB without any unit suffix (e.g.,4096not4096M). The usage examples show users appending "MB" themselves when displaying the value. No internal code in the codebase consumes or parses this variable—it is passed through to job environments unchanged. The change aligns the implementation with the documented specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)
58-78: Remove unnecessary capture ofthisin the lambda.The
node_id_to_strlambda capturesthisbut doesn't use any member variables or functions from the enclosing class. Onlynodelistis used, which is already captured.Apply this diff to remove the unnecessary capture:
- auto node_id_to_str = [nodelist, this]() -> std::string { + auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)
🔇 Additional comments (2)
src/Craned/Core/JobManager.cpp (2)
91-93: LGTM! Division by zero guard is properly implemented.The
mem_per_cpucalculation correctly guards against division by zero usingstd::abs(cpus_on_node) > 1e-8, which addresses the previous review concern.
83-83: Proto fields are present and not commented out—review concern is invalid.Verification confirms all six protobuf fields accessed in JobManager.cpp are defined in the StepToD message (protos/PublicDefs.proto, lines 299–335) and are not commented out:
total_gpus(line 327)exclude_nodelist(line 328)cwd(line 318)cpus_per_task(line 325)ntasks_per_node(line 324)submit_hostname(line 326)The code will compile successfully; the original review concern was based on an inaccurate premise.
Likely an incorrect or invalid review comment.
95b2780 to
d3924d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
376-386: Critical: Unresolved merge conflict markers in code.The code contains unresolved Git merge conflict markers that will cause compilation failures:
<<<<<<< HEAD ======= >>>>>>> 76168279 (rebase)This must be resolved before the code can be merged.
Apply this diff to remove the conflict markers:
job_to_d.set_account(job->account); job_to_d.set_qos(job->qos); job_to_d.set_partition(job->partition_id); -<<<<<<< HEAD -======= ->>>>>>> 76168279 (rebase) *job_to_d.mutable_res() = crane::grpc::ResourceInNode(m_allocated_res_.at(craned_id));
🧹 Nitpick comments (2)
src/Craned/Core/JobManager.cpp (2)
139-233: Clarify JobManager loop shutdown semantics between ctor/dtor andWait()The destructor sets
m_is_ending_now_ = trueand then joins the uvw thread, whileWait()only joins the thread without flipping the flag or stopping the loop. That means any caller usingWait()as the sole shutdown hook can block indefinitely unless something else has already setm_is_ending_now_/stopped the loop.If that’s not a strict invariant of your call sites, consider either:
- Setting
m_is_ending_now_ = true(or explicitly stopping the loop) insideWait(), or- Adding a brief comment documenting that
Wait()must only be called after an external shutdown trigger (e.g.,EvSigintCb_) has been invoked.Also applies to: 278-282, 514-516
1140-1238: Avoid holdingstep_map_mtxwhile calling potentially blockingTerminateTaskRPCsIn
EvCleanTerminateTaskQueueCb_you acquire both:
auto map_ptr = m_job_map_.GetMapExclusivePtr();absl::MutexLock lk(job_instance->step_map_mtx.get());and then, while still under
step_map_mtx, you loop overterminate_step_idsand callstub->TerminateTask(...)for each step.If
TerminateTaskis a synchronous RPC or otherwise blocking, this can:
- Serialize all other operations needing
step_map_mtxfor that job (status updates, further cleanup).- Risk deadlock if the supervisor’s callbacks re-enter JobManager and try to take the same mutex.
Safer pattern:
- Under the locks, determine
terminate_step_ids, whether to mark-as-orphaned, and collect the needed(job_id, step_id)pairs (and maybe stubs).- Release
step_map_mtx(and, if possible, the job-map exclusive pointer).- Then, outside the lock, iterate the collected list and call
TerminateTask.This keeps the critical section small and removes lock‑held RPC calls without changing observable semantics.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CraneCtld.cpp(2 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Common/CgroupManager.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(1 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Craned/Common/CgroupManager.cpp
- protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cppsrc/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
src/Utilities/PublicHeader/PublicHeader.cpp (2)
GpuCount(709-720)GpuCount(709-709)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(160-160)
🔇 Additional comments (6)
src/CraneCtld/CraneCtld.cpp (1)
119-128: First ClusterName validation block looks correct.The validation properly checks for the presence and non-emptiness of ClusterName, exiting early on failure.
src/Utilities/PublicHeader/PublicHeader.cpp (1)
709-720: GpuCount implementation looks correct.The implementation properly handles the case where no GPUs exist (returns 0), and correctly sums both untyped and typed GPU counts from the device map. The use of
std::views::valueswithstd::accumulateis clean and idiomatic.src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)
105-106: GPU resource type constant is now properly used.The constant
kResourceTypeGpuis correctly defined and is actively used by theGpuCount()implementation in PublicHeader.cpp. This addresses the previous review concern about incomplete code.
520-520: GpuCount declaration is properly added.The declaration follows the existing pattern alongside
CpuCount()andMemoryBytes(), providing a complete public API for GPU resource counting.src/CraneCtld/CtldPublicDefs.cpp (1)
423-431: New StepToD field population looks correct.The additions properly populate the new protobuf fields:
submit_hostnamefrom job's TaskToCtldtotal_gpususing the newGpuCount()methodcwdfrom job's cwd (addresses previous review concern)ntasks_per_nodeandcpus_per_taskfrom jobexclude_nodelistiterating over job's excluded_nodessrc/Craned/Core/JobManager.cpp (1)
1122-1138:QueryTaskInfoOfUidcorrectly relies on uid→job_ids invariant
QueryTaskInfoOfUiduses:if (auto task_ids = this->m_uid_to_job_ids_map_[uid]; task_ids) { info.cgroup_exists = true; info.job_cnt = task_ids->size(); info.first_task_id = *task_ids->begin(); }Given the established invariant that if a uid exists in
m_uid_to_job_ids_map_then its task-id set is non-empty (maintained byAllocJobs/FreeJobs), dereferencingbegin()here is safe and matches the intended model of “first task id for this uid”. Implementation looks consistent with that contract.Based on learnings, this matches the maintained non‑empty set invariant for
m_uid_to_job_ids_map_.
e76117a to
db1b3fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)
60-60: Remove unusedthiscapture from lambda.The
thispointer is captured but never used within the lambda body. This creates unnecessary coupling.- auto node_id_to_str = [nodelist, this]() -> std::string { + auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
GetResourceEnvMapByResInNode(664-676)GetResourceEnvMapByResInNode(664-665)
🔇 Additional comments (2)
src/Craned/Core/JobManager.cpp (2)
54-56: LGTM on the logging addition.The initialization logging is helpful for debugging environment setup issues.
115-132: Environment variable additions look good overall.The new environment variables (
CRANE_EXCLUDES,CRANE_SUBMIT_DIR,CRANE_CPUS_PER_TASK,CRANE_NTASKS_PER_NODE,CRANE_GPUS,CRANE_GPUS_PER_NODE,CRANE_MEM_PER_CPU,CRANE_NTASKS,CRANE_CLUSTER_NAME,CRANE_CPUS_ON_NODE,CRANE_NODEID,CRANE_SUBMIT_HOST) provide comprehensive job environment context. The formatting withstd::formatfor floating-point values is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)
120-120: Clarify intent: duplicate key with inconsistent format.
CRANE_MEM_PER_NODEis already set at line 56 byCgroupManager::GetResourceEnvMapByResInNode()with an "M" suffix (e.g.,"1024M") using the job-level resource (job_to_d.res()).Line 120 attempts to emplace the same key using the daemon step-level resource (
daemon_step_to_d.res()) without a unit suffix. Sinceemplacedoesn't overwrite existing keys, this line has no effect.If the daemon step's memory value should take precedence, use
insert_or_assignand add the "M" suffix for consistency:- env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node)); + env_map.insert_or_assign("CRANE_MEM_PER_NODE", std::format("{}M", mem_in_node));If the job-level value is correct, remove this line.
Note: A similar issue was flagged in a previous review and marked as addressed, but the current code still exhibits this behavior.
🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)
59-80: LGTM! Optional: remove unusedthiscapture.The lambda correctly handles hostname lookup and edge cases. The nested lambda issue from the previous review has been resolved.
Minor optimization: the lambda captures
thisbut doesn't use any class members. Consider removing it from the capture list.Apply this diff to remove the unused capture:
- auto node_id_to_str = [nodelist, this]() -> std::string { + auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/Craned/Core/JobManager.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (2)
src/Craned/Common/CgroupManager.cpp (2)
GetResourceEnvMapByResInNode(664-676)GetResourceEnvMapByResInNode(664-665)src/Craned/Core/CranedPublicDefs.h (1)
g_config(160-160)
🔇 Additional comments (4)
src/Craned/Core/JobManager.cpp (4)
55-55: LGTM!The initialization log provides useful debugging context.
82-86: Verify integer division forgpus_per_nodeis intentional.The calculation uses integer division, which truncates any remainder. For example, 5 GPUs across 2 nodes results in
gpus_per_node = 2, discarding the remainder. This might be the intended behavior (representing the "base" GPU count per node), but it could be misleading if the actual GPU distribution is uneven (e.g., 3 on one node, 2 on another).Confirm that this truncation aligns with the intended semantics of the
CRANE_GPUS_PER_NODEenvironment variable.
87-95: LGTM!The memory and CPU calculations correctly handle unit conversions and guard against near-zero divisions.
115-119: LGTM!The environment variable assignments correctly populate job, resource, and submission metadata. The formatting and calculations (except for line 120, addressed separately) are appropriate.
Also applies to: 121-132
c4871e6 to
d531ccd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)
115-130:CRANE_MEM_PER_NODEduplication and format inconsistency withCgroupManager
env_mapalready containsCRANE_MEM_PER_NODEfromCgroupManager::GetResourceEnvMapByResInNode()(seesrc/Craned/Common/CgroupManager.cpp:663-675), where the value is formatted as"{}M"(megabytes with anMsuffix). Here you:env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node));Because
emplacedoes not overwrite existing keys, this line is a no-op whenCgroupManagerhas already populated the key, and when it does insert (if the key was absent) the unit format is different (noMsuffix). That creates inconsistent behavior depending on the call path.To make the daemon step’s view authoritative and keep units consistent, consider:
- env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node)); + env_map.insert_or_assign("CRANE_MEM_PER_NODE", + std::format("{}M", mem_in_node));The rest of the new CRANE_* env vars (
CPUS_PER_TASK,NTASKS_PER_NODE,GPUS,GPUS_PER_NODE,MEM_PER_CPU,NTASKS,CLUSTER_NAME,CPUS_ON_NODE,NODEID,SUBMIT_HOST) look consistent and useful.Based on relevant snippet from
src/Craned/Common/CgroupManager.cpp.
🧹 Nitpick comments (5)
src/Craned/Core/Craned.cpp (1)
408-417: ClusterName mandatory check looks correct; consider slightly clearer loggingThe new
ClusterNamevalidation correctly enforces that the field exists and is non‑empty before proceeding, consistent with howControlMachineand other required config entries are handled. From a UX perspective, you might optionally distinguish the “missing” vs “empty” cases in the log messages, but the current behavior is functionally fine.Example (optional) tweak:
- if (config["ClusterName"]) { - g_config.CraneClusterName = config["ClusterName"].as<std::string>(); - if (g_config.CraneClusterName.empty()) { - CRANE_ERROR("Cluster name is empty."); - std::exit(1); - } - } else { - CRANE_ERROR("Cluster name is empty."); - std::exit(1); - } + if (config["ClusterName"]) { + g_config.CraneClusterName = config["ClusterName"].as<std::string>(); + if (g_config.CraneClusterName.empty()) { + CRANE_ERROR("ClusterName is configured but empty."); + std::exit(1); + } + } else { + CRANE_ERROR("ClusterName is not configured."); + std::exit(1); + }src/Utilities/PublicHeader/PublicHeader.cpp (1)
709-720: Simplify GPU count accumulation and verify<numeric>includeThe logic of
GpuCount()looks correct (untyped + per-type counts). The current use ofstd::views::values+std::accumulateis valid but a bit indirect and introduces a dependency on<numeric>if it’s not already included. A straightforward loop overtype_mapis easier to read and avoids that dependency.You could simplify Lines 715–717 as:
- uint64_t type_sum = std::accumulate(std::views::values(type_map).begin(), - std::views::values(type_map).end(), - static_cast<uint64_t>(0)); - - return untyped_count + type_sum; + uint64_t type_sum = 0; + for (const auto& [_, cnt] : type_map) { + type_sum += cnt; + } + + return untyped_count + type_sum;If you prefer to keep
std::accumulate, please double-check that<numeric>is included somewhere in the translation unit or headers.src/Craned/Core/JobManager.cpp (3)
55-55: Log level for env init may be too chatty
GetJobEnvMap()can be called for every step/job; logging"Init env data"at INFO may flood logs on busy systems. Consider DEBUG/TRACE or adding identifiers (job/step) if you intend to keep it at INFO.
59-80: Avoid copyingnodelistand capturingthisunnecessarily innode_id_to_strThe lambda logic looks correct now, but there are two small cleanups:
daemon_step_to_d.nodelist()is likely a protobuf repeated field;auto nodelist = ...copies it. A const reference is enough here.thisis captured but never used.You can simplify to:
- auto nodelist = daemon_step_to_d.nodelist(); - auto node_id_to_str = [nodelist, this]() -> std::string { + const auto& nodelist = daemon_step_to_d.nodelist(); + auto node_id_to_str = [&nodelist]() -> std::string { uint32_t node_idx = 0; std::array<char, HOST_NAME_MAX + 1> host_name{}; if (gethostname(host_name.data(), host_name.size()) != 0) { return std::to_string(-1); // invalid } std::string_view host_name_view(host_name.data()); for (const auto& node_name : nodelist) { if (node_name == host_name_view) { break; } node_idx++; } if (node_idx >= nodelist.size()) { return std::to_string(-1); // invalid } return std::to_string(node_idx); };This avoids an extra container copy and makes the capture list precise.
82-96:mem_per_cpuguard is correct but the condition could be clearerThe new guard avoids division by zero, which fixes the earlier critical issue. Given that
cpu_core_limit()is conceptually “CPUs on node” and is expected to be strictly positive when valid, the epsilon-based check is slightly harder to read than a direct non-zero/positive test.If
cpu_core_limit()is integral or otherwise guaranteed non-negative, you might prefer:- auto cpus_on_node = - daemon_step_to_d.res().allocatable_res_in_node().cpu_core_limit(); - auto mem_per_cpu = (std::abs(cpus_on_node) > 1e-8) - ? (static_cast<double>(mem_in_node) / cpus_on_node) - : 0.0; + auto cpus_on_node = + daemon_step_to_d.res().allocatable_res_in_node().cpu_core_limit(); + auto mem_per_cpu = (cpus_on_node != 0) + ? (static_cast<double>(mem_in_node) / cpus_on_node) + : 0.0;(or
cpus_on_node > 0if negatives should be treated as invalid).If
cpu_core_limit()is a fractional double and you explicitly want an epsilon check, keeping the current version is fine; consider adding a brief comment explaining the choice.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Common/CgroupManager.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/JobManager.cpp(2 hunks)src/Utilities/PublicHeader/PublicHeader.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/Utilities/PublicHeader/include/crane/PublicHeader.h
- src/Craned/Common/CgroupManager.cpp
- src/CraneCtld/CraneCtld.cpp
- protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/CtldPublicDefs.cppsrc/Craned/Core/Craned.cppsrc/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (2)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(160-160)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
GetResourceEnvMapByResInNode(664-676)GetResourceEnvMapByResInNode(664-665)
🔇 Additional comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
371-381: JobToD enrichment with account/qos/partition looks consistentPopulating
job_to_dwithjob->account,job->qos, andjob->partition_idlines up with how these fields are already maintained onTaskInCtldand gives Craned more context without changing control flow. Thejobpointer is set inInitFromJob/RecoverFromDb, so dereferences here follow existing invariants and do not introduce new failure modes.
084842e to
6a558b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)
340-341: Consider adding documentation fortotal_gpussemantics (from previous review feedback).The previous reviewer requested explicit documentation for
total_gpusto clarify its meaning, derivation, and handling. Since this field is being restored, apply the requested documentation:bool get_user_env = 25; - string submit_hostname = 26; - uint32 total_gpus = 27; + string submit_hostname = 26; + // Total number of GPUs allocated for this task, summed across all nodes. + // Derived from ResourceView::GpuCount() in CtldPublicDefs.cpp and CranedKeeper.cpp. + // Zero when running CPU-only jobs. + uint32 total_gpus = 27;This aligns with the producer/consumer analysis from the prior review and ensures clarity for downstream consumers (TaskManager.cpp:202, TaskManager.cpp:244).
6a558b2 to
6531f28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
protos/PublicDefs.proto (1)
341-341: Add documentation comment fortotal_gpusfield.A previous review flagged that
total_gpusshould be documented to clarify its semantics (total across all nodes, not per-node), derivation source, and behavior for CPU-only jobs. Please add an inline comment similar to the one suggested previously.bool get_user_env = 25; string submit_hostname = 26; + // Total number of GPUs allocated for this task, summed across all nodes. + // Zero for CPU-only jobs. Derived from ResourceView::GpuCount(). uint32 total_gpus = 27;
🧹 Nitpick comments (2)
src/Craned/Core/JobManager.cpp (2)
55-55: Consider lowering log level or adding context.This log statement will execute for every step spawn. Either use
CRANE_TRACEfor less noise in production, or include thejob_idfor more useful debugging information.- CRANE_INFO("Init env data"); + CRANE_TRACE("Init env data for job #{}", job_id);
59-82: Remove unusedthiscapture and eliminate redundant call.
- Line 60:
thisis captured but never used inside the lambda.- Line 82: Calls
daemon_step_to_d.nodelist()again instead of reusing thenodelistvariable from line 59.- auto nodelist = daemon_step_to_d.nodelist(); - auto node_id_to_str = [nodelist, this]() -> std::string { + const auto& nodelist = daemon_step_to_d.nodelist(); + auto node_id_to_str = [&nodelist]() -> std::string { uint32_t node_idx = 0; // ... rest of lambda }; - auto alloc_node_num = daemon_step_to_d.nodelist().size(); + auto alloc_node_num = nodelist.size();
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
protos/PublicDefs.proto(2 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/Craned/Core/JobManager.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CraneCtld/CtldPublicDefs.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Applied to files:
src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
GetResourceEnvMapByResInNode(664-676)GetResourceEnvMapByResInNode(664-665)
🔇 Additional comments (2)
protos/PublicDefs.proto (1)
184-184: LGTM: Restoring submit_hostname fields as part of revert.Re-adding
submit_hostnameto bothTaskToCtld(line 184, field 42) andStepToD(line 340, field 26) correctly restores these fields. Field numbers appear appropriate and avoid conflicts.Also applies to: 340-340
src/Craned/Core/JobManager.cpp (1)
111-126: LGTM!The environment variable additions are well-structured. The
insert_or_assignforCRANE_MEM_PER_NODEcorrectly addresses the previous review concern about potential duplicate keys and ensures consistent formatting with the "M" suffix.
src/Craned/Core/JobManager.cpp
Outdated
| env_map.emplace("CRANE_JOB_PARTITION", job_to_d.partition()); | ||
| env_map.emplace("CRANE_JOB_QOS", job_to_d.qos()); | ||
|
|
||
| env_map.emplace("CRANE_SUBMIT_DIR", daemon_step_to_d.cwd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个CWD是工作目录,不是提交的目录,可以用--chdir改
|
LGTM |
4e10d27 to
7ed0dbe
Compare
7ed0dbe to
e2eb875
Compare
Signed-off-by: junlinli <[email protected]>
This reverts commit 268d260.
Summary by CodeRabbit
Bug Fixes
New Features
Style
✏️ Tip: You can customize this high-level summary in your review settings.