Skip to content

Conversation

@1daidai1
Copy link
Collaborator

@1daidai1 1daidai1 commented Aug 26, 2025

This reverts commit 268d260.

Summary by CodeRabbit

  • Bug Fixes

    • Startup now requires a non-empty cluster name.
    • Job/step metadata now includes submit host, account, QoS and partition information.
    • Removed emission of the job end-time environment variable.
  • New Features

    • Runtime env exposes cluster name, submit host, submit directory, node ID, CPU/memory/GPU and task counts, and total GPUs.
    • Public API adds a GPU resource identifier and a GPU-count accessor.
  • Style

    • Memory-per-node values now include an "M" suffix.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds protobuf fields for submit host/dir and total GPUs; enforces non-empty ClusterName in Ctld and Craned; exposes GPU-count API in public headers; populates additional job/step metadata and many per-node environment variables; tweaks memory-per-node string formatting.

Changes

Cohort / File(s) Change Summary
Protobuf additions
protos/PublicDefs.proto
Added: TaskToCtld.submit_hostname (string, 42), TaskToCtld.submit_dir (string, 43). Updated StepToCtld.container tag from 26→36. Added: StepToD.submit_hostname (string, 26), StepToD.total_gpus (uint32, 27), StepToD.submit_dir (string, 28).
Ctld config & proto population
src/CraneCtld/CraneCtld.cpp, src/CraneCtld/CtldPublicDefs.cpp
Enforce presence/non‑emptiness of ClusterName (exit on missing/empty) in Ctld; GetJobToD now sets account, qos, partition_id; GetStepToD populates submit_hostname, submit_dir, total_gpus, cwd, ntasks_per_node, cpus_per_task.
Craned config parsing
src/Craned/Core/Craned.cpp
Enforce presence/non‑emptiness of ClusterName in ParseConfig (exit on missing/empty).
JobManager env & resource calculations
src/Craned/Core/JobManager.cpp
Build per-node/resource context and add many CRANE_* env vars (e.g., CRANE_SUBMIT_DIR, CRANE_SUBMIT_HOST, CRANE_GPUS, CRANE_NTASKS_PER_NODE, CRANE_CPUS_PER_TASK, CRANE_MEM_PER_NODE, CRANE_MEM_PER_CPU, CRANE_CPUS_ON_NODE, CRANE_NODEID, CRANE_CLUSTER_NAME, etc.); removed emission of CRANE_JOB_END_TIME.
Public GPU API
src/Utilities/PublicHeader/include/crane/PublicHeader.h, src/Utilities/PublicHeader/PublicHeader.cpp
Add inline const char* const kResourceTypeGpu = "gpu"; and uint64_t ResourceView::GpuCount() const implementation.
Cgroup mem string formatting
src/Craned/Common/CgroupManager.cpp
Format CRANE_MEM_PER_NODE as string with trailing "M" via std::format("{}M", ...) instead of plain numeric string.

Sequence Diagram(s)

sequenceDiagram
    rect rgb(240,248,255)
    participant User
    participant Ctld
    participant Craned
    participant JobManager
    participant CgroupMgr
    note right of Ctld: Protobuf fields updated\n(TaskToCtld/StepToD)
    end
    User->>Ctld: submit job (includes submit_host/dir)
    Ctld->>Ctld: validate ClusterName (exit if missing/empty)
    Ctld->>Craned: send JobToD / StepToD (populate submit_hostname, submit_dir, total_gpus, account/qos/partition)
    Craned->>JobManager: allocate nodes, compute per-node resources
    JobManager->>JobManager: populate CRANE_* env vars (GPUS, MEM_PER_NODE, NTASKS_PER_NODE, etc.)
    JobManager->>CgroupMgr: set cgroups (mem limits formatted as "NNM")
    CgroupMgr-->>JobManager: ack
    JobManager-->>Craned: launch step
    Craned-->>Ctld: status updates
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • NamelessOIer
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Add job env' is vague and generic, using non-descriptive terms that don't clearly convey what job environment features or variables are being added or modified. Expand the title to be more specific about which job environment variables or features are being added/modified, e.g., 'Add GPU and resource environment variables to job context' or 'Add job submission environment variables'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/revertenv

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ed0dbe and e237182.

📒 Files selected for processing (8)
  • protos/PublicDefs.proto
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Common/CgroupManager.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/Utilities/PublicHeader/PublicHeader.cpp
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
  • src/CraneCtld/CraneCtld.cpp
  • src/Craned/Common/CgroupManager.cpp
  • src/Utilities/PublicHeader/PublicHeader.cpp
  • src/Craned/Core/Craned.cpp
  • protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
🔇 Additional comments (4)
src/CraneCtld/CtldPublicDefs.cpp (1)

420-422: LGTM!

The additions of account, qos, and partition fields to the JobToD message properly populate job metadata for downstream consumers.

src/Craned/Core/JobManager.cpp (3)

58-65: LGTM!

The node_id_to_str lambda is well-implemented. It correctly captures nodelist, uses std::ranges::find to locate the current hostname, and returns the node index or -1 if not found. The previous issue with nested lambdas and incorrect captures has been properly resolved.


67-76: LGTM!

The resource calculations are correct:

  • Memory conversion from bytes to MB is standard
  • Division-by-zero protection using epsilon check (std::abs(cpus_on_node) > 1e-8) properly guards the mem_per_cpu calculation

The previous critical issue regarding unguarded division has been properly resolved.


96-111: LGTM with dependency on first file fix.

The new environment variables properly expose job metadata and resource allocations to the supervisor:

  • Submit context (submit dir, submit host)
  • Resource breakdowns (CPUs per task, memory per node/CPU, GPUs)
  • Task counts (ntasks per node, total ntasks)
  • Node identification (node ID, cluster name, CPUs on node)

The formatting with std::format("{:.2f}", ...) for floating-point values is appropriate. The use of insert_or_assign for CRANE_MEM_PER_NODE correctly handles the potential duplicate key issue.

Note: CRANE_GPUS (line 103) depends on total_gpus being correctly set in src/CraneCtld/CtldPublicDefs.cpp:465. Please ensure that field represents the job-level total GPU count, not per-node count (see comment on the first file).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (9)
src/Craned/Core/Craned.cpp (1)

268-277: Clarify error messages and trim whitespace for ClusterName.

Differentiate “missing” vs “empty” for better operator feedback, and strip surrounding whitespace before validation. This avoids rejecting a value like " cluster " and produces clearer logs.

Apply within this hunk:

-      if (config["ClusterName"]) {
-        g_config.CraneClusterName = config["ClusterName"].as<std::string>();
-        if (g_config.CraneClusterName.empty()) {
-          CRANE_ERROR("Cluster name is empty.");
-          std::exit(1);
-        }
-      } else {
-        CRANE_ERROR("Cluster name is empty.");
-        std::exit(1);
-      }
+      if (config["ClusterName"]) {
+        std::string cluster = config["ClusterName"].as<std::string>();
+        g_config.CraneClusterName =
+            std::string(absl::StripAsciiWhitespace(cluster));
+        if (g_config.CraneClusterName.empty()) {
+          CRANE_ERROR("ClusterName is present but empty after trimming.");
+          std::exit(1);
+        }
+      } else {
+        CRANE_ERROR("ClusterName is not configured.");
+        std::exit(1);
+      }
src/Craned/Core/CgroupManager.cpp (1)

755-758: Potential backward-compat change: CRANE_MEM_PER_NODE now includes a unit suffix.

Appending “M” changes the variable from a pure integer to a human-formatted string. If users or scripts parse it numerically, this will break them. Consider keeping the original numeric env and adding a new human-readable one.

Option A (preserve numeric; add human-readable):

-  env_map.emplace(
-      "CRANE_MEM_PER_NODE",
-      std::format("{}M",
-                  res_in_node.allocatable_res_in_node().memory_limit_bytes() /
-                      (1024ULL * 1024ULL)));
+  const auto mem_mib =
+      res_in_node.allocatable_res_in_node().memory_limit_bytes() /
+      (1024ULL * 1024ULL);
+  env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_mib));
+  env_map.emplace("CRANE_MEM_PER_NODE_M", std::format("{}M", mem_mib));

Option B (if unit suffix is required, prefer unambiguous “MiB”):

-  env_map.emplace(
-      "CRANE_MEM_PER_NODE",
-      std::format("{}M",
+  env_map.emplace(
+      "CRANE_MEM_PER_NODE",
+      std::format("{}MiB",
                   res_in_node.allocatable_res_in_node().memory_limit_bytes() /
                       (1024ULL * 1024ULL)));
src/CraneCtld/CraneCtld.cpp (1)

126-136: Deduplicate ClusterName assignment and improve validation messages.

Cluster name is read at Lines 97–99 and again here. Keep a single, validated assignment (here, after logger init) and distinguish “not configured” vs “empty”. Also trim whitespace.

Apply within this hunk:

-      if (config["ClusterName"]) {
-        g_config.CraneClusterName = config["ClusterName"].as<std::string>();
-        if (g_config.CraneClusterName.empty()) {
-          CRANE_ERROR("ClusterName is empty.");
-          std::exit(1);
-        }
-      } else {
-        CRANE_ERROR("ClusterName is empty.");
-        std::exit(1);
-      }
+      if (config["ClusterName"]) {
+        std::string cluster = config["ClusterName"].as<std::string>();
+        g_config.CraneClusterName =
+            std::string(absl::StripAsciiWhitespace(cluster));
+        if (g_config.CraneClusterName.empty()) {
+          CRANE_ERROR("ClusterName is present but empty after trimming.");
+          std::exit(1);
+        }
+      } else {
+        CRANE_ERROR("ClusterName is not configured.");
+        std::exit(1);
+      }

Additionally remove the earlier pre-validation assignment to avoid double writes (outside this hunk):

// Remove Lines 97–99:
// if (config["ClusterName"])
//   g_config.CraneClusterName = config["ClusterName"].as<std::string>();
src/CraneCtld/CtldPublicDefs.cpp (1)

358-360: Clamp total_gpus to avoid narrowing and consider DRYing out TaskToD construction

We confirmed that ResourceView::GpuCount() returns a uint64_t, while the TaskToD.total_gpus protobuf field is a uint32. Although overflowing 2³² GPUs is unlikely in practice, it’s safer to clamp rather than rely on an implicit narrowing conversion.

• At src/CraneCtld/CtldPublicDefs.cpp:359, wrap the assignment in a saturating cast:

-  task_to_d.set_total_gpus(this->allocated_res_view.GpuCount());
+  {
+    const auto g = this->allocated_res_view.GpuCount();
+    task_to_d.set_total_gpus(static_cast<uint32_t>(
+        std::min<uint64_t>(g, std::numeric_limits<uint32_t>::max())));
+  }

• The same pattern appears in src/CraneCtld/RpcService/CranedKeeper.cpp:338; applying the clamp there maintains consistency.

Optional refactor: extract a helper (e.g. ClampToUint32(uint64_t)) or centralize the TaskToD builder logic so both CtldPublicDefs and CranedKeeper invoke the same code path, reducing duplication and drift over time.

protos/PublicDefs.proto (1)

171-172: Add brief field docs and confirm presence semantics for submit_hostname.

Adding submit_hostname is fine and backward-compatible. Please add a one-line comment describing whether this is the submission node’s short host or FQDN and whether empty string is a valid/expected value. If you need to distinguish “unset” from “empty”, consider optional string in proto3.

src/Utilities/PublicHeader/PublicHeader.cpp (1)

709-721: Avoid iterating two different temporary views; simplify accumulation; include .

Using std::accumulate(std::views::values(type_map).begin(), std::views::values(type_map).end(), ...) constructs two different temporary views, which is brittle. Also, ensure <numeric> is included for std::accumulate if you keep it. A simple loop is clearer and avoids view lifetime gotchas.

Apply this diff to simplify the implementation:

-uint64_t ResourceView::GpuCount() const {
-  auto it = device_map.find(kResourceTypeGpu);
-  if (it == device_map.end()) return 0;
-
-  const auto& [untyped_count, type_map] = it->second;
-
-  uint64_t type_sum = std::accumulate(std::views::values(type_map).begin(),
-                                      std::views::values(type_map).end(),
-                                      static_cast<uint64_t>(0));
-
-  return untyped_count + type_sum;
-}
+uint64_t ResourceView::GpuCount() const {
+  auto it = device_map.find(kResourceTypeGpu);
+  if (it == device_map.end()) return 0;
+
+  const auto& [untyped_count, type_map] = it->second;
+  uint64_t type_sum = 0;
+  for (const auto& [/*type*/, cnt] : type_map) type_sum += cnt;
+  return untyped_count + type_sum;
+}

Additionally, if <numeric> is not already included transitively, add it near the top of this file:

#include <numeric>
src/Craned/Supervisor/TaskManager.cpp (3)

216-223: Duplicate env keys won’t override due to emplace semantics; use insert_or_assign.

CRANE_JOB_NODELIST and CRANE_EXCLUDES are already set in StepInstance::GetStepProcessEnv(). Using emplace here is a no-op if the keys exist. If the intent is “latest wins,” use insert_or_assign.

Apply this diff:

-  env_map.emplace(
+  env_map.insert_or_assign(
       "CRANE_JOB_NODELIST",
       absl::StrJoin(this->m_parent_step_inst_->GetStep().allocated_nodes(),
                     ";"));
-  env_map.emplace(
+  env_map.insert_or_assign(
       "CRANE_EXCLUDES",
       absl::StrJoin(this->m_parent_step_inst_->GetStep().excludes(), ";"));

Optionally, do the same for the rest of the newly added CRANE_* keys in this block for consistent semantics.


237-237: Consider adding unit suffix to CRANE_MEM_PER_NODE for clarity.

Elsewhere, memory values are often suffixed. If you want to expose MiB explicitly, append “M”.

Apply this diff:

-  env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node));
+  env_map.emplace("CRANE_MEM_PER_NODE", fmt::format("{}M", mem_in_node));

234-253: Consistent formatting approach needed
We currently mix std::format and fmt::format across multiple files (e.g. AccountManager.cpp, CgroupManager.cpp), and TaskManager.cpp indirectly includes <format> via crane/PublicHeader.h. We should settle on one strategy:

• If we standardize on std::format (C++20), confirm our minimum C++ standard/toolchain support it and leave these calls unchanged.
• If we prefer fmt::format (bundled via spdlog), perform a repo-wide migration: replace all std::format calls with fmt::format, add the necessary #include <spdlog/fmt/bundled/format.h> where needed, and ensure build flags pick up the bundled fmt headers.

For this PR, please choose one of the following:

Option A: Keep std::format (no change)
Option B: Switch TaskManager.cpp to fmt::format

--- a/src/Craned/Supervisor/TaskManager.cpp
+++ b/src/Craned/Supervisor/TaskManager.cpp
@@
-  env_map.emplace(
-      "CRANE_CPUS_PER_TASK",
-      std::format("{:.2f}",
-                  this->m_parent_step_inst_->GetStep().cpus_per_task()));
+  env_map.emplace(
+      "CRANE_CPUS_PER_TASK",
+      fmt::format("{:.2f}", this->m_parent_step_inst_->GetStep().cpus_per_task()));
@@
-  env_map.emplace("CRANE_MEM_PER_CPU", std::format("{:.2f}", mem_per_cpu));
+  env_map.emplace("CRANE_MEM_PER_CPU", fmt::format("{:.2f}", mem_per_cpu));
@@
-  env_map.emplace("CRANE_CPUS_ON_NODE", std::format("{:.2f}", cpus_on_node));
+  env_map.emplace("CRANE_CPUS_ON_NODE", fmt::format("{:.2f}", cpus_on_node));

And at the top of TaskManager.cpp (if choosing Option B):

#include <spdlog/fmt/bundled/format.h>
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 92b3f18 and 26a5b01.

📒 Files selected for processing (13)
  • protos/PublicDefs.proto (2 hunks)
  • protos/Supervisor.proto (1 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (1 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (1 hunks)
  • src/Craned/Core/CgroupManager.cpp (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (1 hunks)
  • src/Craned/Supervisor/Supervisor.cpp (1 hunks)
  • src/Craned/Supervisor/SupervisorPublicDefs.h (1 hunks)
  • src/Craned/Supervisor/TaskManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/Craned/Supervisor/Supervisor.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Supervisor/SupervisorPublicDefs.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/Supervisor.cpp
🧬 Code graph analysis (6)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (148-148)
src/Craned/Supervisor/Supervisor.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (148-148)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (148-148)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
src/Utilities/PublicHeader/PublicHeader.cpp (2)
  • GpuCount (709-720)
  • GpuCount (709-709)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (148-148)
src/Craned/Supervisor/TaskManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (148-148)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: format
🔇 Additional comments (7)
src/Craned/Supervisor/SupervisorPublicDefs.h (1)

86-87: Add Config.CraneClusterName: looks good; please confirm init-before-use.

Field addition is consistent with existing naming. Since tasks read CRANE_CLUSTER_NAME from g_config, please verify Supervisor::InitFromStdin sets CraneClusterName before any TaskManager environment population runs.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)

91-92: Introduce kResourceTypeGpu constant: OK.

Constant name and value align with usage in device_map. No concerns.


488-489: Add ResourceView::GpuCount(): OK; ensure headers in .cpp cover /.

API surface looks consistent with CpuCount/MemoryBytes. Confirm PublicHeader.cpp includes the necessary headers for std::accumulate and std::views::values.

src/Craned/Core/JobManager.cpp (1)

477-477: Propagate cluster name to Supervisor – validation already in place

The call to

init_req.set_crane_cluster_name(g_config.CraneClusterName);

in JobManager.cpp correctly propagates the cluster name. Startup checks for a non-empty CraneClusterName already exist in both components:

  • Craned (src/Craned/Core/Craned.cpp:269–273) logs an error and exits if g_config.CraneClusterName is empty.
  • CraneCtld (src/CraneCtld/CraneCtld.cpp:127–131) does the same validation in its startup sequence.

No additional checks are needed here.

src/Craned/Supervisor/Supervisor.cpp (1)

81-81: LGTM: Supervisor reads crane_cluster_name during init (pre-logger, pre-threads)

The assignment is early in InitFromStdin and does not write to STDOUT before SupervisorReady, honoring the supervisor IPC contract. No issues.

protos/Supervisor.proto (1)

61-76: Confirmed C++ protobuf stubs include the new field

  • In src/Craned/Supervisor/Supervisor.pb.h, you’ll find the generated constant for field 12 (kCraneClusterNameFieldNumber = 12).
  • In src/Craned/Supervisor/Supervisor.pb.cc, the accessor and serialization logic for crane_cluster_name() is present (e.g. line 81) as expected.

No further C++ stub files remain to update. If your service publishes additional language bindings (Go, Python, Java, etc.), please regenerate and commit those stubs as well to prevent runtime/schema mismatches.

src/CraneCtld/RpcService/CranedKeeper.cpp (1)

337-339: I've provided a script to locate and inspect the protobuf file defining TaskToD and its total_gpus field so we can confirm its type. Let me know the results, and then I can finalize the guidance on saturating the cast and/or refactoring.

@1daidai1 1daidai1 closed this Aug 26, 2025
@1daidai1 1daidai1 reopened this Sep 5, 2025
@Nativu5
Copy link
Collaborator

Nativu5 commented Nov 12, 2025

这个能不能再推进一下

@1daidai1 1daidai1 force-pushed the dev/revertenv branch 2 times, most recently from 5c57821 to 80d9630 Compare November 14, 2025 03:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26a5b01 and ac7226b.

📒 Files selected for processing (6)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Common/CgroupManager.cpp (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/Craned/Common/CgroupManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (3)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
🔇 Additional comments (4)
src/CraneCtld/CraneCtld.cpp (1)

127-136: LGTM! Proper cluster name validation added.

The validation logic correctly ensures ClusterName is present and non-empty before proceeding with configuration. This aligns with the corresponding validation in src/Craned/Core/Craned.cpp.

src/Craned/Core/Craned.cpp (1)

376-385: LGTM! Consistent cluster name validation.

The validation logic mirrors the implementation in src/CraneCtld/CraneCtld.cpp, ensuring ClusterName is properly validated across both components.

src/Craned/Core/JobManager.cpp (2)

58-78: LGTM! Node ID calculation is properly implemented.

The lambda correctly computes the node index by matching the hostname against the nodelist, with appropriate error handling that returns "-1" for invalid cases.


112-129: Environment variables correctly populated.

The new environment variables provide comprehensive job metadata including submission context (CRANE_SUBMIT_HOST, CRANE_SUBMIT_DIR, CRANE_EXCLUDES), resource distribution (CRANE_GPUS_PER_NODE, CRANE_NTASKS_PER_NODE, CRANE_MEM_PER_NODE), and node-specific identifiers (CRANE_NODEID, CRANE_CLUSTER_NAME). The formatting and value sources are appropriate.

Note: CRANE_MEM_PER_CPU depends on the division operation flagged in the previous comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)

58-78: Consider logging when hostname lookup fails.

The lambda correctly handles the case where the current hostname is not found in the nodelist by returning "-1". However, this could indicate a configuration issue or unexpected state. Consider adding a log warning when this occurs to aid debugging.

Example:

 if (node_idx >= daemon_step_to_d.nodelist().size()) {
+  CRANE_WARN("Hostname '{}' not found in job nodelist", host_name_view);
   return std::to_string(-1);  // invalid
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac7226b and 376fa99.

📒 Files selected for processing (2)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
🔇 Additional comments (3)
src/CraneCtld/CtldPublicDefs.cpp (1)

335-337: LGTM! Metadata propagation is consistent.

The changes correctly populate account, qos, partition, submit_hostname, total_gpus, cwd, ntasks_per_node, cpus_per_task, and exclude_nodelist fields from the job object into JobToD and StepToD messages. The initialization and recovery paths are also updated consistently.

Also applies to: 376-378, 420-427, 536-538

src/Craned/Core/JobManager.cpp (2)

80-93: Resource calculations look correct.

The division by zero protection on lines 91-93 properly addresses the past review comment by checking std::abs(cpus_on_node) > 1e-8 before dividing. The GPU per-node calculation on line 83 uses integer division, which will truncate if GPUs aren't evenly distributed across nodes (e.g., 5 GPUs / 2 nodes = 2 GPUs per node). This is likely acceptable behavior for a scheduling system.


114-131: LGTM! Environment variable expansion is comprehensive.

The new environment variables provide useful job metadata including excludes, submit directory, per-task/per-node resource allocations, GPU counts, cluster name, node ID, and submit host. All values are computed from validated sources and properly formatted.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 376fa99 and 9b8c287.

📒 Files selected for processing (1)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp

Comment on lines 1375 to 1378
if (!task->password_entry) {
CRANE_DEBUG("password_entry empty");
return std::unexpected(CraneErrCode::ERR_INVALID_UID);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Upgrade the log level and improve the log message.

A null password_entry at this point indicates either a programming error or a serious validation failure upstream. CRANE_DEBUG is too low for this condition—it should be at least CRANE_WARN or CRANE_ERROR to ensure visibility in production logs.

Additionally, the log message should include task context (e.g., task ID, user ID) to aid debugging.

Apply this diff to improve logging:

   if (!task->password_entry) {
-    CRANE_DEBUG("password_entry empty");
+    CRANE_ERROR("Task submission failed: password_entry is null for uid {}. Task ID: {}", 
+                task->uid, task->TaskId());
     return std::unexpected(CraneErrCode::ERR_INVALID_UID);
   }

Additionally, consider verifying that all upstream code paths properly initialize password_entry:

#!/bin/bash
# Description: Find where TaskInCtld objects are created and verify password_entry initialization

# Search for TaskInCtld constructor calls and password_entry assignments
ast-grep --pattern $'new TaskInCtld($$$)'

# Search for password_entry field assignments
rg -n "password_entry\s*=" --type=cpp -C 3
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.cpp around lines 1375 to 1378, the code logs a
missing password_entry at CRANE_DEBUG without context; change the log to
CRANE_ERROR (or at minimum CRANE_WARN) and include task context such as task->id
and task->user_id (or equivalent fields) in the message to make the failure
visible and actionable, return the same error after logging; additionally, run
an audit of TaskInCtld creation sites and any assignments to password_entry
(using the suggested rg/ast-grep commands) to ensure upstream code always
initializes password_entry and add defensive checks where needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b8c287 and dc3368c.

📒 Files selected for processing (1)
  • CMakeLists.txt (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Build, Upload and Test
CMakeLists.txt

[error] 190-190: CRANE_ADDRESS_SANITIZER and CRANE_THREAD_SANITIZER cannot be enabled at the same time.

CMakeLists.txt Outdated
Comment on lines 123 to 125
option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON)

option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF)
option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Conflicting sanitizer defaults break the build.

ThreadSanitizer and AddressSanitizer require separate QA runs and cannot be enabled together. Setting both options to ON directly violates the constraint at line 189–191, causing an immediate fatal CMake error on any configuration.

Expected behavior: Only one sanitizer should be enabled at a time by default. The pipeline failure confirms this is a blocker.

Clarify the PR intent: Does this revert the sanitizer enablement (should both be OFF), or does it require a different approach to select one sanitizer?

If the intent is to enable sanitizers by default, either:

  • Set one to ON and the other to OFF, or
  • Add logic to auto-select based on build type or compiler
- option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON)
+ option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" OFF)
  
- option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON)
+ option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" ON)
option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF)
option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" ON)
option(CRANE_ADDRESS_SANITIZER "Enable address sanitizer" OFF)
option(CRANE_THREAD_SANITIZER "Enable thread sanitizer" OFF)

@1daidai1 1daidai1 force-pushed the dev/revertenv branch 3 times, most recently from a1da580 to 7bc4136 Compare November 20, 2025 07:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/CraneCtld/TaskScheduler.cpp (1)

1375-1378: Improve logging for null password_entry guard

The null-check itself is correct and avoids a crash, but a missing password_entry at this point indicates a serious validation/programming issue. Logging only CRANE_DEBUG("password_entry empty") without any task context makes this very hard to diagnose in production.

Consider upgrading the log level and including at least task->uid and task->TaskId():

-  if (!task->password_entry) {
-    CRANE_DEBUG("password_entry empty");
-    return std::unexpected(CraneErrCode::ERR_INVALID_UID);
-  }
+  if (!task->password_entry) {
+    CRANE_ERROR(
+        "Task submission failed: password_entry is null for uid {} (task #{})",
+        task->uid, task->TaskId());
+    return std::unexpected(CraneErrCode::ERR_INVALID_UID);
+  }
🧹 Nitpick comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)

420-428: Inconsistency confirmed: use member variables for alignment with CommonStepInCtld pattern.

The verification confirms that both ntasks_per_node and cpus_per_task are member variables of the DaemonStepInCtld class (declared in CtldPublicDefs.h lines 466–467 and initialized at lines 298–299 from job data).

Lines 423–424 currently access these fields through the job pointer, while the similar CommonStepInCtld::GetStepToD() method at lines 661–662 uses member variables directly. This inconsistency should be resolved by using the member variables, which are already properly initialized and maintained.

All referenced APIs are confirmed to exist:

  • ResourceView::GpuCount() is declared and implemented
  • submit_hostname field exists in protobuf messages (PublicDefs.proto)
  • Both classes have the necessary member variables

Apply this diff for consistency:

  step_to_d.set_submit_hostname(job->TaskToCtld().submit_hostname());
  step_to_d.set_total_gpus(this->requested_node_res_view.GpuCount());
  step_to_d.set_cwd(this->job->cwd);
- step_to_d.set_ntasks_per_node(this->job->ntasks_per_node);
- step_to_d.set_cpus_per_task(this->job->TaskToCtld().cpus_per_task());
+ step_to_d.set_ntasks_per_node(this->ntasks_per_node);
+ step_to_d.set_cpus_per_task(static_cast<double>(this->cpus_per_task));
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1da580 and 7bc4136.

📒 Files selected for processing (3)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/CraneCtld/TaskScheduler.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-05-25T04:11:27.740Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (2)
src/CraneCtld/CtldPublicDefs.cpp (1)

376-379: LGTM! Job metadata correctly propagated to JobToD.

The account, qos, and partition fields are properly initialized in both InitFromJob and RecoverFromDb, ensuring these values are correctly propagated to the daemon.

src/Craned/Core/JobManager.cpp (1)

54-133: Per-node job env additions and submit-host env look correct

The new per-node resource calculations (including the guarded mem_per_cpu division) and CRANE_SUBMIT_HOST derived from submit_hostname() are consistent with the existing resource/env model and avoid division-by-zero or out-of-range issues. No functional problems spotted here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

501-501: Consider removing commented-out code.

The commented-out GpuCount() method declaration should either be removed entirely if it's no longer needed, or restored with proper implementation if GPU count tracking is required. Commented-out code without explanation is generally considered dead code and makes maintenance harder.

If this method is definitively unused, apply this diff:

   uint64_t MemoryBytes() const;
-  // uint64_t GpuCount() const;

   AllocatableResource& GetAllocatableRes() { return allocatable_res; }
src/Craned/Core/JobManager.cpp (2)

58-78: Consider commenting out the unused lambda.

The node_id_to_str lambda is defined but never invoked—its only usage at Line 131 is commented out. Consider commenting out or removing this lambda definition to avoid confusion and reduce code clutter until the feature is ready to be enabled.


80-93: Consider commenting out unused calculations.

The calculations for gpus_per_node, mem_in_node, cpus_on_node, and mem_per_cpu are correct and include proper guards against division by zero (addressing the past review concern). However, these values are largely unused since the corresponding env_map.emplace calls at Lines 114-132 are commented out. Consider commenting out these calculations as well to avoid wasting CPU cycles until the environment variables are ready to be enabled.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc4136 and 161474b.

📒 Files selected for processing (4)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Utilities/PublicHeader/PublicHeader.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🔇 Additional comments (3)
src/CraneCtld/CtldPublicDefs.cpp (2)

376-379: LGTM!

The population of JobToD fields (account, qos, partition) from the job object is straightforward and correct.


420-428: Verify the commented-out total_gpus field.

The StepToD population looks correct. However, Line 421 has set_total_gpus commented out, which suggests this feature may not be fully implemented yet. If this field is intended to be used, please verify whether it should be enabled or if there are blockers preventing its activation.

src/Craned/Core/JobManager.cpp (1)

114-132: Verify C++20 support before enabling commented code.

The commented-out environment variable assignments use std::format (e.g., Line 118, 125, 130), which requires C++20. Before uncommenting this code, ensure the project is compiled with C++20 or later. If C++20 is not available, consider using fmt::format from the {fmt} library or std::to_string with manual precision control.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

95-96: Public constant added for incomplete GPU feature—already flagged in previous review.

As identified in the previous review, kResourceTypeGpu is only referenced in commented-out code (line 710 of PublicHeader.cpp). Adding this to the public API surface when the GPU counting feature is incomplete creates confusion and violates API stability principles.

🧹 Nitpick comments (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

501-501: Remove commented-out code rather than keeping it in source.

Commented-out declarations clutter the codebase. If GpuCount() is not ready for use, remove the declaration entirely rather than commenting it out. Version control preserves the history if you need to restore it later.

src/Utilities/PublicHeader/PublicHeader.cpp (1)

709-720: Remove commented-out implementation rather than keeping it in source.

The entire GpuCount() implementation is commented out. If this feature is incomplete or deferred, remove the code entirely rather than leaving it commented. Version control maintains the history if you need to restore it in the future. Keeping commented blocks creates maintenance overhead and can confuse readers about which code is actually active.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 161474b and 92fc772.

📒 Files selected for processing (4)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/Craned/Core/JobManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.

@1daidai1 1daidai1 force-pushed the dev/revertenv branch 2 times, most recently from f0ea766 to cfd34b2 Compare November 21, 2025 04:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/Craned/Core/Craned.cpp (1)

376-385: Remove commented-out code or add a TODO comment explaining the deprecation.

Verification confirms that g_config.CraneClusterName is actively used in CTLD (event serialization in CranedMetaContainer.cpp and CtldGrpcServer.cpp) and safely initialized there. In CRANED, both the initialization and usage are consistently commented out (JobManager.cpp:129), so no runtime errors occur. However, leaving large blocks of commented code without explanation creates maintenance burden. Consider either:

  1. Remove the commented block entirely if this is permanent deprecation in CRANED, or
  2. Add a TODO comment explaining why this code is intentionally disabled (e.g., "TODO: Remove after v1.0 release" or "Feature flag for future cluster-name support")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0ea766 and cfd34b2.

📒 Files selected for processing (4)
  • src/CraneCtld/CraneCtld.cpp (3 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Craned/Core/JobManager.cpp
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/Craned/Core/Craned.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (1)
src/CraneCtld/CraneCtld.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
src/Utilities/PublicHeader/String.cpp (2)
  • CalcConfigCRC32 (431-433)
  • CalcConfigCRC32 (431-431)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: prepare
🔇 Additional comments (3)
src/Craned/Core/Craned.cpp (1)

253-253: LGTM: Configuration CRC tracking added.

The CRC32 calculation of the configuration is appropriately placed immediately after loading the YAML config, enabling configuration version tracking or change detection.

src/CraneCtld/CtldPublicDefs.cpp (1)

376-379: Uncomment and populate account, qos, and partition fields in JobToD message.

These fields are actively used by Craned nodes to set environment variables (CRANE_JOB_ACCOUNT, CRANE_JOB_PARTITION, CRANE_JOB_QOS) for job execution. Keeping them commented out will cause these environment variables to be unset, breaking job accounting, QoS enforcement, and partition-specific configurations on execution nodes.

Restore the assignments at lines 376-379:

  job_to_d.set_account(job->account);
  job_to_d.set_qos(job->qos);
  job_to_d.set_partition(job->partition_id);
⛔ Skipped due to learnings
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
src/CraneCtld/CraneCtld.cpp (1)

101-102: CRC calculation location is correct.

The CRC is properly calculated immediately after loading the config file at line 101. Subsequent code only reads from the config object (lines 103+: config["ClusterName"], config["CraneBaseDir"], etc.) with no modifications. This design is consistent with the parallel implementation in Craned.cpp:254, where the same pattern applies. Calculating the CRC early ensures it captures the raw file state for integrity verification between services.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/CraneCtld/CraneCtld.cpp (1)

133-143: Commented‑out ClusterName validation weakens config safety

This block previously enforced a non‑empty ClusterName and failed fast otherwise; now CraneCtld will start even if ClusterName is missing/empty, leaving g_config.CraneClusterName at its default or an empty string. That can break identification in multi‑cluster or event/metrics consumers that expect a real cluster name.

If ClusterName is still conceptually required, consider restoring this validation (and potentially removing any now‑redundant simple assignment) so misconfigured clusters don’t start silently:

-      // if (config["ClusterName"]) {
-      //   g_config.CraneClusterName =
-      //   config["ClusterName"].as<std::string>(); if
-      //   (g_config.CraneClusterName.empty()) {
-      //     CRANE_ERROR("ClusterName is empty.");
-      //     std::exit(1);
-      //   }
-      // } else {
-      //   CRANE_ERROR("ClusterName is empty.");
-      //   std::exit(1);
-      // }
+      if (config["ClusterName"]) {
+        g_config.CraneClusterName =
+            config["ClusterName"].as<std::string>();
+        if (g_config.CraneClusterName.empty()) {
+          CRANE_ERROR("ClusterName is empty.");
+          std::exit(1);
+        }
+      } else {
+        CRANE_ERROR("ClusterName is empty.");
+        std::exit(1);
+      }

If the intent is to make ClusterName optional, it would be good to document that and double‑check all downstream uses to ensure empty is handled correctly.

🧹 Nitpick comments (1)
src/CraneCtld/CraneCtld.cpp (1)

106-110: Remove duplicate CraneBaseDir assignment

g_config.CraneBaseDir is assigned twice in a row with the same expression; the second assignment is redundant noise.

You can safely keep a single assignment:

-      g_config.CraneBaseDir =
-          YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir);
-
-      g_config.CraneBaseDir =
-          YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir);
+      g_config.CraneBaseDir =
+          YamlValueOr(config["CraneBaseDir"], kDefaultCraneBaseDir);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfd34b2 and 15e69f8.

📒 Files selected for processing (2)
  • src/CraneCtld/CraneCtld.cpp (2 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
🧬 Code graph analysis (1)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)

95-95: Commenting out kResourceTypeGpu correctly retracts unfinished public API

This line is now purely a comment, so the GPU resource type constant is no longer part of the public header/API, which aligns with reverting the unfinished GPU feature without impacting existing consumers.


501-501: Commented-out GpuCount() keeps ResourceView interface stable

By commenting out GpuCount() while its implementation is inactive, the ResourceView public interface stays consistent and avoids exposing a non-functional method during this revert.

@1daidai1 1daidai1 force-pushed the dev/revertenv branch 2 times, most recently from 4886991 to 8be9527 Compare November 21, 2025 06:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

95-95: Commented-out GPU API stubs are safe but could be cleaned up

Commenting out kResourceTypeGpu and ResourceView::GpuCount() avoids exposing a half-finished GPU API in the public header, which matches the revert intent and addresses earlier concerns. If this functionality is not coming back soon, consider deleting these commented lines (or moving any future GPU helpers to an internal header) to keep the public surface minimal and avoid confusing dead code.

Also applies to: 470-517

src/Utilities/PublicHeader/PublicHeader.cpp (1)

709-720: GpuCount implementation correctly disabled; consider full removal

ResourceView::GpuCount() is now entirely commented out, consistent with the header and with reverting unfinished GPU-counting behavior. From a runtime standpoint this is safe. To reduce noise, consider removing this block entirely (or reintroducing it only when the GPU feature is finalized) rather than keeping a large commented implementation.

src/Craned/Core/JobManager.cpp (1)

58-93: Large commented env-var block: harmless but noisy

The added code in GetJobEnvMap is fully commented out (helper lambda, GPU/CPU/memory calculations, and extra CRANE_* env entries), so current behavior is unchanged apart from reverting the extra envs. That’s safe for this revert.

If the plan is to disable these env vars long-term, consider either:

  • Removing the commented-out logic entirely, or
  • Adding a brief TODO comment explaining that this block is a parked implementation tied to a future GPU/env feature or to the reverted commit ID.

This will keep the function easier to read for future maintainers.

Also applies to: 114-132

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15e69f8 and 4886991.

📒 Files selected for processing (7)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CraneCtld.cpp (2 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • protos/PublicDefs.proto
  • src/Craned/Core/Craned.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CraneCtld.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)

122-122: Duplicate issue: total_gpus() field access flagged above.

This line also accesses daemon_step_to_d.total_gpus(), which was flagged in the earlier review comment for lines 80-84.

🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)

58-78: Clarify or remove commented-out node ID code.

Large blocks of commented-out code (the node_id_to_str lambda and the CRANE_NODEID environment variable) reduce readability. If this functionality is temporarily disabled or planned for future use, add a TODO comment explaining why. Otherwise, remove it to keep the codebase clean.

Also applies to: 130-130

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e17f6b6 and 2465527.

📒 Files selected for processing (1)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (4)
src/Craned/Core/JobManager.cpp (4)

131-131: Original review comment is incorrect.

The verification confirms that submit_hostname is defined in the protobuf schema (found at two locations in proto files). The code at line 131 in src/Craned/Core/JobManager.cpp correctly accesses this existing field via daemon_step_to_d.submit_hostname(). No compilation error would result from this code.

Likely an incorrect or invalid review comment.


80-84: No issues found. The total_gpus field exists in the protobuf definition and is properly accessible.

The verification confirms that total_gpus is defined in the proto file (line 333: uint32 total_gpus = 27;) and is successfully used in JobManager.cpp at lines 83 and 122. The original concern that the field was removed or inaccessible is incorrect. The code at lines 80-84 will compile and execute without errors.


128-128: ****

The CraneClusterName field is defined in src/Craned/Core/CranedPublicDefs.h as std::string CraneClusterName, and is initialized from config in src/Craned/Core/Craned.cpp during startup. The usage at line 128 is valid and will not cause a compilation error. The concern raised in the review comment is not supported by the codebase.

Likely an incorrect or invalid review comment.


114-115: The verification confirms that exclude_nodelist does exist in the protobuf schema. The field is defined at line 334 as repeated string exclude_nodelist = 28;, and the method call at line 115 in JobManager.cpp is valid.

However, to complete the verification as originally requested by the review comment, I need to check the other potentially problematic fields mentioned in the AI summary. Let me verify those as well:

Now let me check the exact context and how exclude_nodelist() is being used, since protobuf repeated string fields don't typically have a fieldname() accessor:

Perfect! The web search confirms that for a repeated string field, the generated accessor foo() const returns the underlying RepeatedPtrField<string>& that stores the field's elements. This means exclude_nodelist() is the correct way to access the field, and it will work with absl::StrJoin.

The verification confirms that the code at line 115 is correct—the exclude_nodelist() method does exist in the protobuf and the usage with absl::StrJoin is valid.

Based on the script execution and web search results:

The review comment's concern about exclude_nodelist() is unsubstantiated.

The field is properly defined in the proto schema (line 334: repeated string exclude_nodelist = 28;) and is being correctly accessed via the generated method. There are no compilation errors expected for this code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/CraneCtld.cpp (1)

101-102: Remove or comment lines 101-102, or restore validation at lines 130-135 for consistent ClusterName handling.

Lines 101-102 actively read and assign ClusterName from config, while the validation that checks for empty values (lines 130-135) is commented out. This allows empty ClraneClusterName to reach event emission calls at CtldGrpcServer.cpp:1771 and CranedMetaContainer.cpp:780.

If disabling ClusterName handling (as suggested by the commented validation), lines 101-102 should also be commented or removed. If making ClusterName optional, restore the validation to prevent empty values from being passed to events. The current state is semantically inconsistent.

♻️ Duplicate comments (1)
src/CraneCtld/CraneCtld.cpp (1)

130-140: Duplicate concern: ClusterName validation removal.

This change was already flagged in a previous review as potentially allowing invalid configurations. The commented-out validation ensured ClusterName was non-empty, preventing issues in multi-cluster deployments.

If ClusterName is being intentionally disabled across the system (consistent with the broader PR changes), this should be clearly documented. Otherwise, the validation should be restored.

🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)

177-177: Use reserved keywords for proper protobuf field deprecation.

The verification confirms that all usages of these fields are already commented out and no active code references them. However, the original recommendation to use protobuf's reserved keyword is sound: it prevents these field numbers and names from being accidentally reused in future development, which could cause wire-format compatibility issues.

The suggested refactor remains appropriate:

 message TaskToCtld {
   ...
-  // string submit_hostname = 40;
+  reserved 40;
+  reserved "submit_hostname";
 }

 message StepToD {
   ...
-  // string submit_hostname = 26;
-  // uint32 total_gpus = 27;
-  // repeated string exclude_nodelist = 28;
+  reserved 26, 27, 28;
+  reserved "submit_hostname", "total_gpus", "exclude_nodelist";
 }

This follows protobuf deprecation best practices and prevents future issues with field reuse.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2465527 and 116e74d.

📒 Files selected for processing (8)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Common/CgroupManager.cpp (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • src/Craned/Core/JobManager.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/Craned/Core/Craned.cpp
  • src/Utilities/PublicHeader/PublicHeader.cpp
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Common/CgroupManager.cpp
  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
🔇 Additional comments (1)
src/Craned/Common/CgroupManager.cpp (1)

669-680: ****

The format change is correct. The documentation explicitly defines CRANE_MEM_PER_NODE as an integer value in MB without any unit suffix (e.g., 4096 not 4096M). The usage examples show users appending "MB" themselves when displaying the value. No internal code in the codebase consumes or parses this variable—it is passed through to job environments unchanged. The change aligns the implementation with the documented specification.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)

58-78: Remove unnecessary capture of this in the lambda.

The node_id_to_str lambda captures this but doesn't use any member variables or functions from the enclosing class. Only nodelist is used, which is already captured.

Apply this diff to remove the unnecessary capture:

-  auto node_id_to_str = [nodelist, this]() -> std::string {
+  auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 070eb75 and 95b2780.

📒 Files selected for processing (1)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (156-156)
🔇 Additional comments (2)
src/Craned/Core/JobManager.cpp (2)

91-93: LGTM! Division by zero guard is properly implemented.

The mem_per_cpu calculation correctly guards against division by zero using std::abs(cpus_on_node) > 1e-8, which addresses the previous review concern.


83-83: Proto fields are present and not commented out—review concern is invalid.

Verification confirms all six protobuf fields accessed in JobManager.cpp are defined in the StepToD message (protos/PublicDefs.proto, lines 299–335) and are not commented out:

  • total_gpus (line 327)
  • exclude_nodelist (line 328)
  • cwd (line 318)
  • cpus_per_task (line 325)
  • ntasks_per_node (line 324)
  • submit_hostname (line 326)

The code will compile successfully; the original review concern was based on an inaccurate premise.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)

376-386: Critical: Unresolved merge conflict markers in code.

The code contains unresolved Git merge conflict markers that will cause compilation failures:

<<<<<<< HEAD
=======

>>>>>>> 76168279 (rebase)

This must be resolved before the code can be merged.

Apply this diff to remove the conflict markers:

   job_to_d.set_account(job->account);
   job_to_d.set_qos(job->qos);
   job_to_d.set_partition(job->partition_id);
-<<<<<<< HEAD
-=======

->>>>>>> 76168279 (rebase)
   *job_to_d.mutable_res() =
       crane::grpc::ResourceInNode(m_allocated_res_.at(craned_id));
🧹 Nitpick comments (2)
src/Craned/Core/JobManager.cpp (2)

139-233: Clarify JobManager loop shutdown semantics between ctor/dtor and Wait()

The destructor sets m_is_ending_now_ = true and then joins the uvw thread, while Wait() only joins the thread without flipping the flag or stopping the loop. That means any caller using Wait() as the sole shutdown hook can block indefinitely unless something else has already set m_is_ending_now_/stopped the loop.

If that’s not a strict invariant of your call sites, consider either:

  • Setting m_is_ending_now_ = true (or explicitly stopping the loop) inside Wait(), or
  • Adding a brief comment documenting that Wait() must only be called after an external shutdown trigger (e.g., EvSigintCb_) has been invoked.

Also applies to: 278-282, 514-516


1140-1238: Avoid holding step_map_mtx while calling potentially blocking TerminateTask RPCs

In EvCleanTerminateTaskQueueCb_ you acquire both:

  • auto map_ptr = m_job_map_.GetMapExclusivePtr();
  • absl::MutexLock lk(job_instance->step_map_mtx.get());

and then, while still under step_map_mtx, you loop over terminate_step_ids and call stub->TerminateTask(...) for each step.

If TerminateTask is a synchronous RPC or otherwise blocking, this can:

  • Serialize all other operations needing step_map_mtx for that job (status updates, further cleanup).
  • Risk deadlock if the supervisor’s callbacks re-enter JobManager and try to take the same mutex.

Safer pattern:

  • Under the locks, determine terminate_step_ids, whether to mark-as-orphaned, and collect the needed (job_id, step_id) pairs (and maybe stubs).
  • Release step_map_mtx (and, if possible, the job-map exclusive pointer).
  • Then, outside the lock, iterate the collected list and call TerminateTask.

This keeps the critical section small and removes lock‑held RPC calls without changing observable semantics.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95b2780 and e76117a.

📒 Files selected for processing (8)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CraneCtld.cpp (2 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Common/CgroupManager.cpp (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (1 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/Craned/Common/CgroupManager.cpp
  • protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-02T09:30:13.014Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/JobManager.cpp:141-149
Timestamp: 2025-04-02T09:30:13.014Z
Learning: In JobManager, if a uid exists in m_uid_to_job_ids_map_, its corresponding task_ids set is guaranteed to be non-empty due to the invariant maintained in AllocJobs and FreeJobs methods.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
src/Utilities/PublicHeader/PublicHeader.cpp (2)
  • GpuCount (709-720)
  • GpuCount (709-709)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
🔇 Additional comments (6)
src/CraneCtld/CraneCtld.cpp (1)

119-128: First ClusterName validation block looks correct.

The validation properly checks for the presence and non-emptiness of ClusterName, exiting early on failure.

src/Utilities/PublicHeader/PublicHeader.cpp (1)

709-720: GpuCount implementation looks correct.

The implementation properly handles the case where no GPUs exist (returns 0), and correctly sums both untyped and typed GPU counts from the device map. The use of std::views::values with std::accumulate is clean and idiomatic.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)

105-106: GPU resource type constant is now properly used.

The constant kResourceTypeGpu is correctly defined and is actively used by the GpuCount() implementation in PublicHeader.cpp. This addresses the previous review concern about incomplete code.


520-520: GpuCount declaration is properly added.

The declaration follows the existing pattern alongside CpuCount() and MemoryBytes(), providing a complete public API for GPU resource counting.

src/CraneCtld/CtldPublicDefs.cpp (1)

423-431: New StepToD field population looks correct.

The additions properly populate the new protobuf fields:

  • submit_hostname from job's TaskToCtld
  • total_gpus using the new GpuCount() method
  • cwd from job's cwd (addresses previous review concern)
  • ntasks_per_node and cpus_per_task from job
  • exclude_nodelist iterating over job's excluded_nodes
src/Craned/Core/JobManager.cpp (1)

1122-1138: QueryTaskInfoOfUid correctly relies on uid→job_ids invariant

QueryTaskInfoOfUid uses:

if (auto task_ids = this->m_uid_to_job_ids_map_[uid]; task_ids) {
  info.cgroup_exists = true;
  info.job_cnt = task_ids->size();
  info.first_task_id = *task_ids->begin();
}

Given the established invariant that if a uid exists in m_uid_to_job_ids_map_ then its task-id set is non-empty (maintained by AllocJobs/FreeJobs), dereferencing begin() here is safe and matches the intended model of “first task id for this uid”. Implementation looks consistent with that contract.

Based on learnings, this matches the maintained non‑empty set invariant for m_uid_to_job_ids_map_.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)

60-60: Remove unused this capture from lambda.

The this pointer is captured but never used within the lambda body. This creates unnecessary coupling.

-  auto node_id_to_str = [nodelist, this]() -> std::string {
+  auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db1b3fc and daaf918.

📒 Files selected for processing (1)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
  • GetResourceEnvMapByResInNode (664-676)
  • GetResourceEnvMapByResInNode (664-665)
🔇 Additional comments (2)
src/Craned/Core/JobManager.cpp (2)

54-56: LGTM on the logging addition.

The initialization logging is helpful for debugging environment setup issues.


115-132: Environment variable additions look good overall.

The new environment variables (CRANE_EXCLUDES, CRANE_SUBMIT_DIR, CRANE_CPUS_PER_TASK, CRANE_NTASKS_PER_NODE, CRANE_GPUS, CRANE_GPUS_PER_NODE, CRANE_MEM_PER_CPU, CRANE_NTASKS, CRANE_CLUSTER_NAME, CRANE_CPUS_ON_NODE, CRANE_NODEID, CRANE_SUBMIT_HOST) provide comprehensive job environment context. The formatting with std::format for floating-point values is consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)

120-120: Clarify intent: duplicate key with inconsistent format.

CRANE_MEM_PER_NODE is already set at line 56 by CgroupManager::GetResourceEnvMapByResInNode() with an "M" suffix (e.g., "1024M") using the job-level resource (job_to_d.res()).

Line 120 attempts to emplace the same key using the daemon step-level resource (daemon_step_to_d.res()) without a unit suffix. Since emplace doesn't overwrite existing keys, this line has no effect.

If the daemon step's memory value should take precedence, use insert_or_assign and add the "M" suffix for consistency:

-  env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node));
+  env_map.insert_or_assign("CRANE_MEM_PER_NODE", std::format("{}M", mem_in_node));

If the job-level value is correct, remove this line.

Note: A similar issue was flagged in a previous review and marked as addressed, but the current code still exhibits this behavior.

🧹 Nitpick comments (1)
src/Craned/Core/JobManager.cpp (1)

59-80: LGTM! Optional: remove unused this capture.

The lambda correctly handles hostname lookup and edge cases. The nested lambda issue from the previous review has been resolved.

Minor optimization: the lambda captures this but doesn't use any class members. Consider removing it from the capture list.

Apply this diff to remove the unused capture:

-  auto node_id_to_str = [nodelist, this]() -> std::string {
+  auto node_id_to_str = [nodelist]() -> std::string {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between daaf918 and c4871e6.

📒 Files selected for processing (1)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (2)
src/Craned/Common/CgroupManager.cpp (2)
  • GetResourceEnvMapByResInNode (664-676)
  • GetResourceEnvMapByResInNode (664-665)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
🔇 Additional comments (4)
src/Craned/Core/JobManager.cpp (4)

55-55: LGTM!

The initialization log provides useful debugging context.


82-86: Verify integer division for gpus_per_node is intentional.

The calculation uses integer division, which truncates any remainder. For example, 5 GPUs across 2 nodes results in gpus_per_node = 2, discarding the remainder. This might be the intended behavior (representing the "base" GPU count per node), but it could be misleading if the actual GPU distribution is uneven (e.g., 3 on one node, 2 on another).

Confirm that this truncation aligns with the intended semantics of the CRANE_GPUS_PER_NODE environment variable.


87-95: LGTM!

The memory and CPU calculations correctly handle unit conversions and guard against near-zero divisions.


115-119: LGTM!

The environment variable assignments correctly populate job, resource, and submission metadata. The formatting and calculations (except for line 120, addressed separately) are appropriate.

Also applies to: 121-132

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/Craned/Core/JobManager.cpp (1)

115-130: CRANE_MEM_PER_NODE duplication and format inconsistency with CgroupManager

env_map already contains CRANE_MEM_PER_NODE from CgroupManager::GetResourceEnvMapByResInNode() (see src/Craned/Common/CgroupManager.cpp:663-675), where the value is formatted as "{}M" (megabytes with an M suffix). Here you:

env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node));

Because emplace does not overwrite existing keys, this line is a no-op when CgroupManager has already populated the key, and when it does insert (if the key was absent) the unit format is different (no M suffix). That creates inconsistent behavior depending on the call path.

To make the daemon step’s view authoritative and keep units consistent, consider:

-  env_map.emplace("CRANE_MEM_PER_NODE", std::to_string(mem_in_node));
+  env_map.insert_or_assign("CRANE_MEM_PER_NODE",
+                           std::format("{}M", mem_in_node));

The rest of the new CRANE_* env vars (CPUS_PER_TASK, NTASKS_PER_NODE, GPUS, GPUS_PER_NODE, MEM_PER_CPU, NTASKS, CLUSTER_NAME, CPUS_ON_NODE, NODEID, SUBMIT_HOST) look consistent and useful.

Based on relevant snippet from src/Craned/Common/CgroupManager.cpp.

🧹 Nitpick comments (5)
src/Craned/Core/Craned.cpp (1)

408-417: ClusterName mandatory check looks correct; consider slightly clearer logging

The new ClusterName validation correctly enforces that the field exists and is non‑empty before proceeding, consistent with how ControlMachine and other required config entries are handled. From a UX perspective, you might optionally distinguish the “missing” vs “empty” cases in the log messages, but the current behavior is functionally fine.

Example (optional) tweak:

-      if (config["ClusterName"]) {
-        g_config.CraneClusterName = config["ClusterName"].as<std::string>();
-        if (g_config.CraneClusterName.empty()) {
-          CRANE_ERROR("Cluster name is empty.");
-          std::exit(1);
-        }
-      } else {
-        CRANE_ERROR("Cluster name is empty.");
-        std::exit(1);
-      }
+      if (config["ClusterName"]) {
+        g_config.CraneClusterName = config["ClusterName"].as<std::string>();
+        if (g_config.CraneClusterName.empty()) {
+          CRANE_ERROR("ClusterName is configured but empty.");
+          std::exit(1);
+        }
+      } else {
+        CRANE_ERROR("ClusterName is not configured.");
+        std::exit(1);
+      }
src/Utilities/PublicHeader/PublicHeader.cpp (1)

709-720: Simplify GPU count accumulation and verify <numeric> include

The logic of GpuCount() looks correct (untyped + per-type counts). The current use of std::views::values + std::accumulate is valid but a bit indirect and introduces a dependency on <numeric> if it’s not already included. A straightforward loop over type_map is easier to read and avoids that dependency.

You could simplify Lines 715–717 as:

-  uint64_t type_sum = std::accumulate(std::views::values(type_map).begin(),
-                                      std::views::values(type_map).end(),
-                                      static_cast<uint64_t>(0));
-
-  return untyped_count + type_sum;
+  uint64_t type_sum = 0;
+  for (const auto& [_, cnt] : type_map) {
+    type_sum += cnt;
+  }
+
+  return untyped_count + type_sum;

If you prefer to keep std::accumulate, please double-check that <numeric> is included somewhere in the translation unit or headers.

src/Craned/Core/JobManager.cpp (3)

55-55: Log level for env init may be too chatty

GetJobEnvMap() can be called for every step/job; logging "Init env data" at INFO may flood logs on busy systems. Consider DEBUG/TRACE or adding identifiers (job/step) if you intend to keep it at INFO.


59-80: Avoid copying nodelist and capturing this unnecessarily in node_id_to_str

The lambda logic looks correct now, but there are two small cleanups:

  • daemon_step_to_d.nodelist() is likely a protobuf repeated field; auto nodelist = ... copies it. A const reference is enough here.
  • this is captured but never used.

You can simplify to:

-  auto nodelist = daemon_step_to_d.nodelist();
-  auto node_id_to_str = [nodelist, this]() -> std::string {
+  const auto& nodelist = daemon_step_to_d.nodelist();
+  auto node_id_to_str = [&nodelist]() -> std::string {
     uint32_t node_idx = 0;

     std::array<char, HOST_NAME_MAX + 1> host_name{};
     if (gethostname(host_name.data(), host_name.size()) != 0) {
       return std::to_string(-1);  // invalid
     }
     std::string_view host_name_view(host_name.data());
     for (const auto& node_name : nodelist) {
       if (node_name == host_name_view) {
         break;
       }
       node_idx++;
     }

     if (node_idx >= nodelist.size()) {
       return std::to_string(-1);  // invalid
     }

     return std::to_string(node_idx);
   };

This avoids an extra container copy and makes the capture list precise.


82-96: mem_per_cpu guard is correct but the condition could be clearer

The new guard avoids division by zero, which fixes the earlier critical issue. Given that cpu_core_limit() is conceptually “CPUs on node” and is expected to be strictly positive when valid, the epsilon-based check is slightly harder to read than a direct non-zero/positive test.

If cpu_core_limit() is integral or otherwise guaranteed non-negative, you might prefer:

-  auto cpus_on_node =
-      daemon_step_to_d.res().allocatable_res_in_node().cpu_core_limit();
-  auto mem_per_cpu = (std::abs(cpus_on_node) > 1e-8)
-                         ? (static_cast<double>(mem_in_node) / cpus_on_node)
-                         : 0.0;
+  auto cpus_on_node =
+      daemon_step_to_d.res().allocatable_res_in_node().cpu_core_limit();
+  auto mem_per_cpu = (cpus_on_node != 0)
+                         ? (static_cast<double>(mem_in_node) / cpus_on_node)
+                         : 0.0;

(or cpus_on_node > 0 if negatives should be treated as invalid).

If cpu_core_limit() is a fractional double and you explicitly want an epsilon check, keeping the current version is fine; consider adding a brief comment explaining the choice.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4871e6 and 405f6a1.

📒 Files selected for processing (8)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Common/CgroupManager.cpp (1 hunks)
  • src/Craned/Core/Craned.cpp (1 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
  • src/Craned/Common/CgroupManager.cpp
  • src/CraneCtld/CraneCtld.cpp
  • protos/PublicDefs.proto
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-12-08T08:11:40.323Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.323Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Core/Craned.cpp
  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (2)
src/Craned/Core/Craned.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
  • GetResourceEnvMapByResInNode (664-676)
  • GetResourceEnvMapByResInNode (664-665)
🔇 Additional comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)

371-381: JobToD enrichment with account/qos/partition looks consistent

Populating job_to_d with job->account, job->qos, and job->partition_id lines up with how these fields are already maintained on TaskInCtld and gives Craned more context without changing control flow. The job pointer is set in InitFromJob/RecoverFromDb, so dereferences here follow existing invariants and do not introduce new failure modes.

@1daidai1 1daidai1 force-pushed the dev/revertenv branch 3 times, most recently from 084842e to 6a558b2 Compare December 10, 2025 06:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)

340-341: Consider adding documentation for total_gpus semantics (from previous review feedback).

The previous reviewer requested explicit documentation for total_gpus to clarify its meaning, derivation, and handling. Since this field is being restored, apply the requested documentation:

  bool get_user_env = 25;
- string submit_hostname = 26;
- uint32 total_gpus = 27;
+ string submit_hostname = 26;
+ // Total number of GPUs allocated for this task, summed across all nodes.
+ // Derived from ResourceView::GpuCount() in CtldPublicDefs.cpp and CranedKeeper.cpp.
+ // Zero when running CPU-only jobs.
+ uint32 total_gpus = 27;

This aligns with the producer/consumer analysis from the prior review and ensures clarity for downstream consumers (TaskManager.cpp:202, TaskManager.cpp:244).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 405f6a1 and 6a558b2.

📒 Files selected for processing (1)
  • protos/PublicDefs.proto (2 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
protos/PublicDefs.proto (1)

341-341: Add documentation comment for total_gpus field.

A previous review flagged that total_gpus should be documented to clarify its semantics (total across all nodes, not per-node), derivation source, and behavior for CPU-only jobs. Please add an inline comment similar to the one suggested previously.

   bool get_user_env = 25;
   string submit_hostname = 26;
+  // Total number of GPUs allocated for this task, summed across all nodes.
+  // Zero for CPU-only jobs. Derived from ResourceView::GpuCount().
   uint32 total_gpus = 27;
🧹 Nitpick comments (2)
src/Craned/Core/JobManager.cpp (2)

55-55: Consider lowering log level or adding context.

This log statement will execute for every step spawn. Either use CRANE_TRACE for less noise in production, or include the job_id for more useful debugging information.

-  CRANE_INFO("Init env data");
+  CRANE_TRACE("Init env data for job #{}", job_id);

59-82: Remove unused this capture and eliminate redundant call.

  1. Line 60: this is captured but never used inside the lambda.
  2. Line 82: Calls daemon_step_to_d.nodelist() again instead of reusing the nodelist variable from line 59.
-  auto nodelist = daemon_step_to_d.nodelist();
-  auto node_id_to_str = [nodelist, this]() -> std::string {
+  const auto& nodelist = daemon_step_to_d.nodelist();
+  auto node_id_to_str = [&nodelist]() -> std::string {
     uint32_t node_idx = 0;
     // ... rest of lambda
   };

-  auto alloc_node_num = daemon_step_to_d.nodelist().size();
+  auto alloc_node_num = nodelist.size();
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a558b2 and 6531f28.

📒 Files selected for processing (3)
  • protos/PublicDefs.proto (2 hunks)
  • src/CraneCtld/CtldPublicDefs.cpp (2 hunks)
  • src/Craned/Core/JobManager.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/CraneCtld/CtldPublicDefs.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/Craned/Core/JobManager.cpp
📚 Learning: 2025-05-09T01:56:53.142Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.

Applied to files:

  • src/Craned/Core/JobManager.cpp
🧬 Code graph analysis (1)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
  • GetResourceEnvMapByResInNode (664-676)
  • GetResourceEnvMapByResInNode (664-665)
🔇 Additional comments (2)
protos/PublicDefs.proto (1)

184-184: LGTM: Restoring submit_hostname fields as part of revert.

Re-adding submit_hostname to both TaskToCtld (line 184, field 42) and StepToD (line 340, field 26) correctly restores these fields. Field numbers appear appropriate and avoid conflicts.

Also applies to: 340-340

src/Craned/Core/JobManager.cpp (1)

111-126: LGTM!

The environment variable additions are well-structured. The insert_or_assign for CRANE_MEM_PER_NODE correctly addresses the previous review concern about potential duplicate keys and ensures consistent formatting with the "M" suffix.

env_map.emplace("CRANE_JOB_PARTITION", job_to_d.partition());
env_map.emplace("CRANE_JOB_QOS", job_to_d.qos());

env_map.emplace("CRANE_SUBMIT_DIR", daemon_step_to_d.cwd());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个CWD是工作目录,不是提交的目录,可以用--chdir改

@L-Xiafeng
Copy link
Collaborator

LGTM

@L-Xiafeng L-Xiafeng force-pushed the dev/revertenv branch 2 times, most recently from 4e10d27 to 7ed0dbe Compare December 25, 2025 09:05
Signed-off-by: junlinli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants