feat: Default node health check #655

edragain2nd · 2025-10-26T16:35:08Z

Summary by CodeRabbit

New Features
- Nodes run periodic health/config checks (CPU, memory, device paths) and send health reports; controller will toggle or clear draining state based on reported health and reason.
- Internal endpoint added to receive node health reports.
Configuration
- New NodeHealthCheckInterval setting (default 0 — disabled) to control check frequency.
- Supervisor gains DebugLevel and LogDir options.
Other
- Added small memory-tolerance constant and standardized failure-reason label.

coderabbitai · 2025-10-26T16:35:18Z

Walkthrough

Adds a Craned->Ctld health-report RPC and handler. Craned schedules periodic node health checks and reports results. Ctld updates node drain state from health reports. Protobuf, Ctld gRPC handler, meta-container state update helper, client timer/check/reporting, config, and public constants added.

Changes

Cohort / File(s)	Summary
Protocol Buffer `protos/Crane.proto`	Added message `CranedReportHealthRequest` (`string craned_id = 1; bool healthy = 2; string reason = 3;`) and RPC `rpc CranedReportHealth(CranedReportHealthRequest) returns (google.protobuf.Empty);` in service `CraneCtldForInternal`.
Ctld meta container `src/CraneCtld/CranedMetaContainer.h`, `src/CraneCtld/CranedMetaContainer.cpp`	Added `UpdateNodeStateWithNodeHealthCheck(const CranedId&, bool, const std::string&)`: validates craned_id, toggles/sets drain state and `state_reason` according to health and reason; tightened partition-existence validation in `CheckIfAccountIsAllowedInPartition`.
Ctld gRPC service `src/CraneCtld/RpcService/CtldGrpcServer.h`, `src/CraneCtld/RpcService/CtldGrpcServer.cpp`	Added `CtldForInternalServiceImpl::CranedReportHealth` handler that accepts `CranedReportHealthRequest`, invokes `g_meta_container->UpdateNodeStateWithNodeHealthCheck(...)`, and returns OK.
Craned client (CtldClient) `src/Craned/Core/CtldClient.h`, `src/Craned/Core/CtldClient.cpp`	Added `m_node_health_check_timer_`, methods `CranedReportHealth_(bool is_healthy, std::string reason)` and `NodeHealthCheck_()`. Timer schedules periodic checks that evaluate CPU/memory/device files and report health to Ctld.
Public constants `src/Utilities/PublicHeader/include/crane/PublicHeader.h`	Added `inline constexpr double kMemoryToleranceGB = 0.01;` and `inline const char* const NodeHealthCheckFailedReason = "Node failed health check";`.
Config and defs `etc/config.yaml`, `src/Craned/Core/CranedPublicDefs.h`, `src/Craned/Core/Craned.cpp`	Added top-level `NodeHealthCheckInterval` (default 0) to config, added `uint32_t NodeHealthCheckInterval` to Craned config struct, and parse/populate logic in `ParseCranedConfig`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Craned as Craned (CtldClient)
    participant Ctld as CraneCtld (CtldGrpcServer)
    participant Meta as MetaContainer

    loop every NodeHealthCheckInterval
        Craned->>Craned: NodeHealthCheck_() — check CPU/memory, device files
        alt healthy
            Craned->>Ctld: CranedReportHealthRequest(healthy=true, reason="")
        else unhealthy
            Craned->>Ctld: CranedReportHealthRequest(healthy=false, reason="...reason...")
        end
        Ctld->>Meta: UpdateNodeStateWithNodeHealthCheck(craned_id, healthy, reason)
        Note right of Meta: set/clear drain state and update `state_reason`
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay attention to concurrency and state-transition correctness in UpdateNodeStateWithNodeHealthCheck (drain toggles and matching reason).
Verify CtldClient timer lifecycle, stop conditions, and gRPC stub null-checks.
Ensure protobuf stubs are regenerated and gRPC signatures compile across client/server.

Possibly related PRs

refactor: ctld grpc internal server to ctld server #581 — touches Ctld internal gRPC handlers and CtldGrpcServer; overlaps with adding CranedReportHealth.
refactor grpc #457 — refactored Ctld-for-internal RPCs; directly related to extending the internal service.
refactor:refactor CgroupManager to the common folder #597 — changes to DeviceManager/includes and config types that may interact with CtldClient device checks.

Suggested reviewers

L-Xiafeng
RileyWen
Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat: Default node health check" accurately reflects the primary changes in the changeset. The modifications across all files implement a comprehensive node health check feature, including the addition of new gRPC messages and handlers for health reporting, helper methods to process health status in the metadata container, configuration parameters to control the health check interval, and client-side methods to periodically check and report node health. The title is clear, specific, and sufficiently descriptive for a developer scanning the repository history to understand that this PR introduces node health checking functionality with default configuration settings.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f454033 and cba19d1.

📒 Files selected for processing (1)

src/Craned/Core/CranedPublicDefs.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/Craned/Core/CranedPublicDefs.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
70-70: Consider increasing the memory tolerance value.

A tolerance of 0.01 GB (10 MB) may be too strict for production environments. OS kernel buffers, caches, and normal memory usage variations can easily exceed 10 MB, potentially causing false positives that incorrectly drain healthy nodes. Consider increasing this to at least 0.5-1.0 GB to account for typical system memory fluctuations.
-inline constexpr double kMemoryToleranceGB = 0.01;
+inline constexpr double kMemoryToleranceGB = 0.5;
src/Craned/Core/CtldClient.cpp (1)
469-476: Add thread naming for better debugging.

The new thread should be named using util::SetCurrentThreadName() for easier identification during debugging and monitoring, following the pattern used in other threads in this file (e.g., line 358).
 m_mem_config_check_thread_ = std::thread([this] {
+  util::SetCurrentThreadName("MemConfigCheck");
   do {
     std::this_thread::sleep_for(std::chrono::seconds(5));
     if (m_stopping_ || !m_stub_) return;

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d8e018 and c61c8a6.

📒 Files selected for processing (8)

protos/Crane.proto (2 hunks)
src/CraneCtld/CranedMetaContainer.cpp (1 hunks)
src/CraneCtld/CranedMetaContainer.h (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)
src/Craned/Core/CtldClient.h (2 hunks)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-07T10:47:59.071Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

src/Craned/Core/CtldClient.h

🧬 Code graph analysis (4)

src/CraneCtld/CranedMetaContainer.h (1)

src/CraneCtld/CranedMetaContainer.cpp (2)

UpdateNodeStateWithMemConfigCheck_ (856-877)

UpdateNodeStateWithMemConfigCheck_ (856-857)

src/CraneCtld/RpcService/CtldGrpcServer.h (1)

src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)

SendMemConfigCheckResult (172-180)

SendMemConfigCheckResult (172-175)

src/Craned/Core/CtldClient.h (1)

src/Craned/Core/CtldClient.cpp (4)

SendMemConfigCheckResult_ (745-759)

SendMemConfigCheckResult_ (745-745)

MemConfigCheck_ (761-793)

MemConfigCheck_ (761-761)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (154-154)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

🔇 Additional comments (9)

src/Craned/Core/CtldClient.h (1)

212-214: LGTM!

The method declarations and thread member follow the existing patterns in the class and are appropriately placed.

Also applies to: 244-244

src/CraneCtld/RpcService/CtldGrpcServer.h (1)

201-204: LGTM!

The gRPC handler declaration follows the standard pattern and is consistent with other handlers in the service.

src/CraneCtld/CranedMetaContainer.h (1)

92-93: LGTM!

The method declaration is clear and appropriately positioned in the public section of the class.

protos/Crane.proto (1)

941-944: LGTM!

The protobuf message and RPC definitions are clean and follow the existing patterns in the service definition.

Also applies to: 1015-1015

src/Craned/Core/CtldClient.cpp (2)

473-473: Address the TODO comment.

The TODO comment suggests that state checking might be needed before running MemConfigCheck_. Should this check whether the client state machine is in the READY state before performing the memory configuration check?

745-759: LGTM!

The SendMemConfigCheckResult_ implementation correctly sends the result via gRPC with appropriate guards for stopping state and stub validity.

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

172-180: LGTM!

The gRPC handler implementation correctly delegates to the meta container and follows the pattern of other simple handlers in this file. Error handling for unknown nodes is appropriately managed within UpdateNodeStateWithMemConfigCheck_.

src/CraneCtld/CranedMetaContainer.cpp (2)

885-891: LGTM!

The added early return for invalid partition improves validation and provides clearer error messages. This strengthens the partition access control logic.

867-874: No changes needed; code logic correctly guards drain state transitions.

The code already prevents the scenario described. When an admin manually drains a node with a custom reason (line 801: state_reason = request.reason()), the MemConfig check logic cannot overwrite it because:

Line 868 checks: state_reason == MemConfigCheckFailedReason before any mem-config-triggered drain modification

Line 871 only sets the mem-config reason if the node was not already drained (!craned_meta->drain)

If admin drains first with reason "hardware maintenance", a subsequent mem-config failure will not enter either condition, preserving the manual reason. When mem-config later passes, it similarly won't clear a manually-set drain because the state_reason won't match. The design correctly isolates mem-config drain management from manual drains.

src/Craned/Core/CtldClient.cpp

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/Craned/Core/CtldClient.cpp (1)
781-784: Fix incorrect CRANE_DEBUG format string.

This issue was already flagged in the past review comments. The format string has no placeholders but arguments are provided, and they won't be logged.

Apply this diff:
 if (std::abs(node.memory_gb - config_cpu_mem_gb) <= kMemoryToleranceGB) {
-  CRANE_DEBUG("MemConfigCheck success.", config_cpu_mem_gb, node.memory_gb);
+  CRANE_DEBUG("MemConfigCheck success. config_mem: {:.3f} GB, real_mem: {:.3f} GB",
+              config_cpu_mem_gb, node.memory_gb);
   SendMemConfigCheckResult_(true);
   return;

🧹 Nitpick comments (4)

src/Craned/Core/CtldClient.cpp (4)
469-476: Set thread name and address the TODO comment.

The thread should have a descriptive name for debugging purposes, consistent with other threads in this file.

Apply this diff to set the thread name:
 m_mem_config_check_thread_ = std::thread([this] {
+  util::SetCurrentThreadName("MemConfigChkThr");
   do {
     std::this_thread::sleep_for(std::chrono::seconds(5));
     if (m_stopping_ || !m_stub_) return;
-    // TODO: should check state?
     MemConfigCheck_();
   } while (true);
 });
Regarding the TODO comment: Consider whether the check should verify g_ctld_client_sm->IsReadyNow() before calling MemConfigCheck_(), similar to the pattern used in AsyncSendThread_() at line 646. The current implementation already checks g_server->ReadyFor(RequestSource::CTLD) inside MemConfigCheck_() at line 762, which may be sufficient.

769-772: Clarify or remove the CPU check.

The silent return when node.cpu == 0 is unclear. If this is a valid edge case, add a log message explaining why the check is skipped. If it's defensive coding against bad data, consider logging a warning.

Consider adding a log message:
 // TODO add gres support
 if (node.cpu == 0) {
+  CRANE_TRACE("Skipping MemConfigCheck: node.cpu is 0");
   return;
 }
785-790: Remove unnecessary return statements.

The return statements at lines 789 and 790 are at the end of their respective branches and are unnecessary.

Apply this diff:
 } else {
   SendMemConfigCheckResult_(false);
   CRANE_DEBUG("MemConfigCheck fail. config_mem : {:.3f}, real_mem : {:.3f}",
               config_cpu_mem_gb, node.memory_gb);
-  return;
 }
769-769: Fix typo in comment.

"surpport" should be "support".
-// TODO add gres surpport
+// TODO add gres support

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c61c8a6 and 1c1c9b6.

📒 Files selected for processing (5)

src/CraneCtld/CranedMetaContainer.h (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/CraneCtld/RpcService/CtldGrpcServer.h
src/Utilities/PublicHeader/include/crane/PublicHeader.h
src/CraneCtld/RpcService/CtldGrpcServer.cpp
src/CraneCtld/CranedMetaContainer.h

🧰 Additional context used

🧬 Code graph analysis (1)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (154-154)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

🔇 Additional comments (1)

src/Craned/Core/CtldClient.cpp (1)

386-386: LGTM!

Properly joins the memory config check thread on destruction, consistent with other thread cleanup in the destructor.

src/Craned/Core/CtldClient.cpp

protos/Crane.proto

src/Craned/Core/CtldClient.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

etc/config.yaml (1)
145-145: Add inline documentation for MemCheckInterval.

The new configuration parameter lacks documentation explaining its purpose and units. Consider adding a comment to help users understand what this controls.

Apply this diff to add documentation:
+# Interval in seconds for craned to check if node memory matches configured value
 MemCheckInterval: 30
src/Craned/Core/CranedPublicDefs.h (1)
152-153: Add documentation comment for MemCheckInterval field.

The new field lacks a comment explaining its purpose and units. Other configuration fields in this struct have documentation.

Apply this diff to add documentation:
 
+  // Interval in seconds for memory configuration health checks
   uint64_t MemCheckInterval;
 };
protos/Crane.proto (1)
941-944: Add documentation for CranedReportHealthRequest message.

The new message lacks documentation explaining the purpose of the matched field. Without context, it's unclear what condition is being reported.

Apply this diff to add documentation:
+// Request from Craned to report memory configuration health status
 message CranedReportHealthRequest {
   string craned_id = 1;
+  // Whether node's actual memory matches configured value within tolerance
   bool matched = 2;
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c1c9b6 and 4c9e1fb.

📒 Files selected for processing (8)

etc/config.yaml (1 hunks)
protos/Crane.proto (2 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
src/Craned/Core/Craned.cpp (1 hunks)
src/Craned/Core/CranedPublicDefs.h (1 hunks)
src/Craned/Core/CtldClient.cpp (2 hunks)
src/Craned/Core/CtldClient.h (2 hunks)

✅ Files skipped from review due to trivial changes (1)

src/Craned/Core/Craned.cpp

🚧 Files skipped from review as they are similar to previous changes (3)

src/CraneCtld/RpcService/CtldGrpcServer.h
src/Craned/Core/CtldClient.h
src/Craned/Core/CtldClient.cpp

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-05-02T07:06:36.103Z

Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

src/CraneCtld/RpcService/CtldGrpcServer.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: format

🔇 Additional comments (1)

protos/Crane.proto (1)

1015-1015: LGTM!

The RPC definition follows the established pattern for internal service RPCs and is appropriately placed in the CraneCtldForInternal service.

src/CraneCtld/RpcService/CtldGrpcServer.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/Craned/Core/CtldClient.cpp (2)
775-780: Use .find() to avoid potential exception from .at().

While the .contains() check at line 775 prevents most issues, using .at() at line 780 still risks std::out_of_range in edge cases. A previous review comment suggested using .find() for safer access.

Apply this diff:
-  if (!g_config.CranedRes.contains(g_config.Hostname)) {
+  auto it = g_config.CranedRes.find(g_config.Hostname);
+  if (it == g_config.CranedRes.end()) {
     CRANE_ERROR("Failed to get node config info.");
     return;
   }
 
-  auto node_config = g_config.CranedRes.at(g_config.Hostname);
+  auto node_config = it->second;
805-805: Remove unnecessary return statement.

The return at the end of a void function is unnecessary.

Apply this diff:
   CRANE_DEBUG("ConfigMatchCheck success.");
   CranedReportHealth_(true);
-  return;
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c9e1fb and 04334ab.

📒 Files selected for processing (5)

etc/config.yaml (1 hunks)
src/Craned/Core/Craned.cpp (1 hunks)
src/Craned/Core/CranedPublicDefs.h (1 hunks)
src/Craned/Core/CtldClient.cpp (2 hunks)
src/Craned/Core/CtldClient.h (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/Craned/Core/Craned.cpp
etc/config.yaml

🧰 Additional context used

🧬 Code graph analysis (2)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (156-156)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

src/Craned/Core/CtldClient.h (1)

src/Craned/Core/CtldClient.cpp (4)

CranedReportHealth_ (747-763)

CranedReportHealth_ (747-747)

ConfigMatchCheck_ (765-806)

ConfigMatchCheck_ (765-765)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (4)

src/Craned/Core/CranedPublicDefs.h (1)

153-153: LGTM!

The addition of ConfigCheckInterval to the Config struct is clean and follows the existing pattern.

src/Craned/Core/CtldClient.h (1)

212-214: LGTM!

The new method declarations and timer member follow existing patterns in the class and are appropriately placed in the private section.

Also applies to: 245-245

src/Craned/Core/CtldClient.cpp (2)

765-806: Good implementation of config health check with helpful debug logging.

The implementation correctly compares CPU and memory resources against configured values and reports health status. The debug messages at lines 796 and 803 provide useful diagnostics.

381-391: I need to verify the actual implementation of the m_ping_handle_ callback to confirm whether it returns bool or void. Let me check that:

Now let me search for the actual m_ping_handle callback implementation to verify the reviewer's claim:

The review comment's assertion about callback return semantics is incorrect.

Based on verification of the codebase and uvw/libuv documentation:

Timer callbacks in libuv return void, not bool. All timer callbacks found in the codebase (TaskScheduler.cpp, JobManager.cpp, TaskManager.cpp) consistently return void.

The mechanism to stop a timer is not by returning a value, but by calling handle.stop() directly within the callback, as demonstrated in JobManager.cpp line 80: if (EvCheckSupervisorRunning_()) handle.stop();

The config_check_timer callback at lines 381-391 correctly returns void and is consistent with all other timer callbacks in the codebase.

The TODO comment at line 385 may warrant clarification, but it is unrelated to callback return types. The code implementation is correct as-is.

Likely an incorrect or invalid review comment.

src/Craned/Core/CtldClient.cpp

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

172-180: Add readiness, allowlist, and online checks before updating node state.

Mirror StepStatusChange/CranedPing preconditions: server ready; craned ID allowed; craned online. Return OK for unauthorized/offline to avoid dropping channel; use UNAVAILABLE only when server not ready. Based on learnings.

Apply within this block:

 grpc::Status CtldForInternalServiceImpl::CranedReportHealth(
     grpc::ServerContext *context,
     const crane::grpc::CranedReportHealthRequest *request,
     google::protobuf::Empty *response) {
-  g_meta_container->UpdateNodeStateWithConfigCheck_(request->craned_id(),
-                                                    request->matched());
-
-  return grpc::Status::OK;
+  if (!g_runtime_status.srv_ready.load(std::memory_order_acquire))
+    return grpc::Status{grpc::StatusCode::UNAVAILABLE,
+                        "CraneCtld Server is not ready"};
+
+  const auto &id = request->craned_id();
+  if (!g_meta_container->CheckCranedAllowed(id)) {
+    CRANE_WARN("Reject health report from unknown node {}", id);
+    return grpc::Status::OK;
+  }
+  if (!g_meta_container->CheckCranedOnline(id)) {
+    CRANE_WARN("Reject health report from offline node {}", id);
+    return grpc::Status::OK;
+  }
+
+  g_meta_container->UpdateNodeStateWithConfigCheck_(id, request->matched());
+  return grpc::Status::OK;
 }

src/Craned/Core/CtldClient.cpp (1)

760-763: Include gRPC error details in log.

Currently logs only is_matched. Add error_code and error_message (and debug_error_string) for diagnostics. This was previously requested and still appears unresolved in this hunk.

Apply:
-  if (!result.ok()) {
-    CRANE_ERROR("CranedReportHealth_ failed: is_matched={}", is_match);
-  }
+  if (!result.ok()) {
+    CRANE_ERROR("CranedReportHealth_ failed: is_matched={}, code={}, msg={}, ctx={}",
+                is_match, static_cast<int>(result.error_code()),
+                result.error_message(), context.debug_error_string());
+  }

🧹 Nitpick comments (3)

src/CraneCtld/CranedMetaContainer.h (1)

92-93: Public API naming: drop trailing underscore or add a non-underscore wrapper.

Exposed public methods shouldn’t use the trailing “” private-helper convention. Either rename to UpdateNodeStateWithConfigCheck or add a non-underscore inline wrapper and keep the “” variant private.
src/Craned/Core/CtldClient.cpp (1)
755-756: Unify RPC deadlines.

Use kCranedRpcTimeoutSeconds for CranedReportHealth_ to match the rest of the client and ease tuning.

Apply:
-  context.set_deadline(std::chrono::system_clock::now() +
-                       std::chrono::seconds(1));
+  context.set_deadline(std::chrono::system_clock::now() +
+                       std::chrono::seconds(kCranedRpcTimeoutSeconds));
Also applies to: 724-726
src/CraneCtld/CranedMetaContainer.cpp (1)
867-876: Avoid “changed” trace when state didn’t change.

Log is emitted unconditionally; add a changed flag to avoid misleading traces when drain remains the same.

Apply:
-  if (craned_meta->drain &&
+  bool changed = false;
+  if (craned_meta->drain &&
       craned_meta->state_reason == ConfigMatchCheckFailedReason) {
-    craned_meta->drain = !is_match;
-    if (is_match) craned_meta->state_reason.clear();
+    bool new_drain = !is_match;
+    changed = (new_drain != craned_meta->drain) || (is_match && !craned_meta->state_reason.empty());
+    craned_meta->drain = new_drain;
+    if (is_match) craned_meta->state_reason.clear();
   } else if (!craned_meta->drain && !is_match) {
-    craned_meta->drain = true;
-    craned_meta->state_reason = ConfigMatchCheckFailedReason;
+    craned_meta->drain = true;
+    craned_meta->state_reason = ConfigMatchCheckFailedReason;
+    changed = true;
   }
-  CRANE_TRACE("ConfigMatch check: craned_id '{}' drain state changed to {}.",
-              craned_id, craned_meta->drain);
+  if (changed)
+    CRANE_TRACE("ConfigMatch check: craned_id '{}' drain state -> {}.",
+                craned_id, craned_meta->drain);

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc1043d and 8369e1b.

📒 Files selected for processing (5)

src/CraneCtld/CranedMetaContainer.cpp (1 hunks)
src/CraneCtld/CranedMetaContainer.h (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/Utilities/PublicHeader/include/crane/PublicHeader.h

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-05-02T07:06:36.103Z

Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

src/CraneCtld/RpcService/CtldGrpcServer.cpp

📚 Learning: 2025-06-23T07:53:30.513Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-05-09T01:54:21.256Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

src/Craned/Core/CtldClient.cpp

🧬 Code graph analysis (2)

src/CraneCtld/CranedMetaContainer.h (1)

src/CraneCtld/CranedMetaContainer.cpp (2)

UpdateNodeStateWithConfigCheck_ (856-877)

UpdateNodeStateWithConfigCheck_ (856-857)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (156-156)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (1)

src/Craned/Core/CtldClient.cpp (1)
382-392: Timer callback consistency and lifecycle.

Other uvw timer handlers in this file return a boolean; this one doesn’t return anything. Align the signature/return to avoid handler mismatch and subtle behavior differences.

Apply:
   m_config_check_timer_->on<uvw::timer_event>(
       [this](const uvw::timer_event&, uvw::timer_handle& h) {
         if (m_stopping_ || !m_stub_) return;
         // TODO: should check state?
         ConfigMatchCheck_();
-      });
+        return true;
+      });
If uvw expects a void-returning listener here, confirm compilation; otherwise adapt all timer callbacks to a single consistent signature.

src/Craned/Core/CtldClient.cpp

src/CraneCtld/CranedMetaContainer.cpp

protos/Crane.proto

src/CraneCtld/CranedMetaContainer.cpp

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/Craned/Core/CtldClient.cpp (1)
777-782: Use CranedIdOfThisNode to index CranedRes consistently.

The code uses g_config.Hostname to look up node configuration, but this was flagged in past reviews and should use g_config.CranedIdOfThisNode instead for consistency with other parts of the codebase (e.g., CranedRegister at line 558).

Apply this diff:
-  if (!g_config.CranedRes.contains(g_config.Hostname)) {
+  if (!g_config.CranedRes.contains(g_config.CranedIdOfThisNode)) {
     CRANE_ERROR("Failed to get node config info.");
     return;
   }
 
-  auto node_config = g_config.CranedRes.at(g_config.Hostname);
+  auto node_config = g_config.CranedRes.at(g_config.CranedIdOfThisNode);
Based on learnings.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
172-180: Add validation checks to prevent unauthorized health reports.

The handler lacks critical validation checks present in similar internal RPC handlers. This could allow unauthorized or offline nodes to modify drain state.

Apply this diff to add validation:
 grpc::Status CtldForInternalServiceImpl::CranedReportHealth(
     grpc::ServerContext *context,
     const crane::grpc::CranedReportHealthRequest *request,
     google::protobuf::Empty *response) {
+  if (!g_runtime_status.srv_ready.load(std::memory_order_acquire))
+    return grpc::Status{grpc::StatusCode::UNAVAILABLE,
+                        "CraneCtld Server is not ready"};
+
+  if (!g_meta_container->CheckCranedAllowed(request->craned_id())) {
+    CRANE_WARN("Reject health report from unknown node {}",
+               request->craned_id());
+    return grpc::Status::OK;
+  }
+
+  if (!g_meta_container->CheckCranedOnline(request->craned_id())) {
+    CRANE_WARN("Reject health report from offline node {}",
+               request->craned_id());
+    return grpc::Status::OK;
+  }
+
   g_meta_container->UpdateNodeStateWithHealthCheck_(
       request->craned_id(), request->healthy(), request->reason());
 
   return grpc::Status::OK;
 }
Based on learnings.

🧹 Nitpick comments (2)

src/Craned/Core/CtldClient.cpp (2)
381-392: Consider removing redundant TODO comment.

The TODO at Line 386 asks whether to check state, but ConfigMatchCheck_() already checks server readiness at Line 768. The current approach is correct—checking inside the function keeps the logic centralized.

If you prefer to avoid the function call overhead when not ready, check the state in the timer callback:
   m_config_check_timer_->on<uvw::timer_event>(
       [this](const uvw::timer_event&, uvw::timer_handle& h) {
-        if (m_stopping_ || !m_stub_) return;
-        // TODO: should check state?
+        if (m_stopping_ || !m_stub_ || !g_ctld_client_sm->IsReadyNow()) return;
         ConfigMatchCheck_();
       });
789-806: Standardize the order of logging and health reporting.

The CPU check (lines 790-794) logs before calling CranedReportHealth_, while the memory check (lines 801-805) calls CranedReportHealth_ first then logs. Standardizing the order improves maintainability.

Apply this diff to make the memory check consistent with the CPU check:
   if (std::abs(node_real.memory_gb - mem_gb_config) > kMemoryToleranceGB) {
-    CranedReportHealth_(false, reason);
     CRANE_WARN("ConfigMatchCheck fail. config_mem : {:.3f}, real_mem : {:.3f}",
                mem_gb_config, node_real.memory_gb);
+    CranedReportHealth_(false, reason);
     return;
   }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8369e1b and 7327317.

📒 Files selected for processing (6)

protos/Crane.proto (2 hunks)
src/CraneCtld/CranedMetaContainer.cpp (1 hunks)
src/CraneCtld/CranedMetaContainer.h (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)
src/Craned/Core/CtldClient.h (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

protos/Crane.proto
src/CraneCtld/CranedMetaContainer.h

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-05-02T07:06:36.103Z

Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

src/CraneCtld/RpcService/CtldGrpcServer.cpp

📚 Learning: 2025-06-30T08:43:44.470Z

Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-06-23T07:53:30.513Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-05-09T01:54:21.256Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

src/Craned/Core/CtldClient.cpp

🧬 Code graph analysis (2)

src/Craned/Core/CtldClient.h (1)

src/Craned/Core/CtldClient.cpp (4)

CranedReportHealth_ (748-765)

CranedReportHealth_ (748-748)

ConfigMatchCheck_ (767-823)

ConfigMatchCheck_ (767-767)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (156-156)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

src/CraneCtld/CranedMetaContainer.cpp

src/Craned/Core/CtldClient.cpp

L-Xiafeng

所有的命名都统一下

etc/config.yaml

src/CraneCtld/CranedMetaContainer.cpp

src/Craned/Core/Craned.cpp

src/Craned/Core/CtldClient.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
72-73: Rename to follow the naming convention.

All other constants in this file use a k prefix (e.g., kDefaultHost, kUnlimitedQosName, kHostFilePath). This constant should be renamed to kNodeHealthCheckFailedReason for consistency.

Apply this diff to fix the naming:
-inline const char* const NodeHealthCheckFailedReason =
+inline const char* const kNodeHealthCheckFailedReason =
     "Node failed health check";

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4b32e7 and 1c983fa.

📒 Files selected for processing (9)

etc/config.yaml (1 hunks)
src/CraneCtld/CranedMetaContainer.cpp (1 hunks)
src/CraneCtld/CranedMetaContainer.h (1 hunks)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
src/Craned/Core/Craned.cpp (1 hunks)
src/Craned/Core/CranedPublicDefs.h (1 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)
src/Craned/Core/CtldClient.h (2 hunks)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

src/CraneCtld/RpcService/CtldGrpcServer.cpp
src/Craned/Core/Craned.cpp
src/Craned/Core/CtldClient.h
src/Craned/Core/CtldClient.cpp
etc/config.yaml

🧰 Additional context used

🧬 Code graph analysis (1)

src/CraneCtld/CranedMetaContainer.h (1)

src/CraneCtld/CranedMetaContainer.cpp (2)

UpdateNodeStateWithNodeHealthCheck (856-878)

UpdateNodeStateWithNodeHealthCheck (856-857)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: format

🔇 Additional comments (3)

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

70-70: LGTM!

The constant follows the naming convention and the tolerance value of 0.01 GB (~10 MB) is reasonable for node health checks.

src/CraneCtld/CranedMetaContainer.h (1)

92-94: LGTM!

The method declaration is well-formed with appropriate parameter types and passing conventions.

src/Craned/Core/CranedPublicDefs.h (1)

153-153: No changes required; zero-value handling is already implemented gracefully.

The code explicitly checks if (g_config.NodeHealthCheckInterval > 0) before starting the health-check timer in CtldClient.cpp. If the value is zero (either by explicit config or any other means), the timer simply does not start—effectively disabling the feature without error. Configuration parsing in Craned.cpp provides a sensible default of 30 seconds, and explicit zero values are handled safely.

src/CraneCtld/CranedMetaContainer.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/Craned/Core/CtldClient.h (2)
214-215: Consider adding documentation for the health check method.

While the method signature is clear, a brief comment explaining that this performs periodic node resource validation (CPU, memory, devices) would improve maintainability.
+  // Performs periodic health checks on node resources (CPU, memory, devices).
   void NodeHealthCheck_();
244-245: Design pattern looks good; consider adding documentation.

The timer member follows the established pattern with m_ping_handle_ and is appropriately placed. Adding a brief comment would clarify its purpose.
+  // Timer for scheduling periodic node health checks.
   std::shared_ptr<uvw::timer_handle> m_node_health_check_timer_;

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c983fa and 3ba658a.

📒 Files selected for processing (2)

src/Craned/Core/CtldClient.cpp (3 hunks)
src/Craned/Core/CtldClient.h (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/Craned/Core/CtldClient.cpp

🧰 Additional context used

🧬 Code graph analysis (1)

src/Craned/Core/CtldClient.h (1)

src/Craned/Core/CtldClient.cpp (4)

CranedReportHealth_ (748-765)

CranedReportHealth_ (748-748)

NodeHealthCheck_ (767-823)

NodeHealthCheck_ (767-767)

src/Craned/Core/CtldClient.h

src/Craned/Core/CtldClient.cpp

src/Craned/Core/Craned.cpp

etc/config.yaml

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/Craned/Core/CtldClient.cpp (2)
749-766: [Unresolved from past review] Improve error logging and optimize parameter passing.

Past review comments on these lines remain unaddressed:

Line 749: The reason parameter should be passed by const std::string& to avoid unnecessary copies.

Line 764: The error log lacks diagnostic details—include result.error_code(), result.error_message(), and context.debug_error_string().

Apply this diff:
-void CtldClient::CranedReportHealth_(bool is_healthy, std::string reason) {
+void CtldClient::CranedReportHealth_(bool is_healthy, const std::string& reason) {
   if (m_stopping_ || !m_stub_) return;
 
   grpc::ClientContext context;
   crane::grpc::CranedReportHealthRequest request;
   google::protobuf::Empty reply;
 
   context.set_deadline(std::chrono::system_clock::now() +
                        std::chrono::seconds(kCranedRpcTimeoutSeconds));
   request.set_craned_id(g_config.CranedIdOfThisNode);
   request.set_healthy(is_healthy);
   request.set_reason(reason);
 
   auto result = m_stub_->CranedReportHealth(&context, request, &reply);
   if (!result.ok()) {
-    CRANE_ERROR("CranedReportHealth failed: is_healthy={}", is_healthy);
+    CRANE_ERROR("CranedReportHealth failed: is_healthy={}, reason: {} | {}, code: {}",
+                is_healthy, result.error_message(), context.debug_error_string(),
+                static_cast<int>(result.error_code()));
   }
 }
778-783: [Unresolved from past review] Use CranedIdOfThisNode to index CranedRes.

Past review flagged this as a major issue: using g_config.Hostname instead of g_config.CranedIdOfThisNode can cause lookup mismatches if the names differ. Based on retrieved learnings, g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Apply this diff:
-  if (!g_config.CranedRes.contains(g_config.Hostname)) {
+  if (!g_config.CranedRes.contains(g_config.CranedIdOfThisNode)) {
     CRANE_ERROR("Failed to get node config info.");
     return;
   }
 
-  auto node_config = g_config.CranedRes.at(g_config.Hostname);
+  auto node_config = g_config.CranedRes.at(g_config.CranedIdOfThisNode);

🧹 Nitpick comments (1)

src/Craned/Core/CtldClient.cpp (1)

787-822: Consider customizing the reason string for each failure type.

Currently, all health check failures report the same constant reason string (NodeHealthCheckFailedReason), regardless of whether the CPU check, memory check, or device file check failed. This reduces diagnostic value for operators receiving the health report.

Consider customizing the reason string at each failure point:

-  std::string reason = NodeHealthCheckFailedReason;
   int64_t cpu_count =
       static_cast<int64_t>(node_config->allocatable_res.cpu_count);
   if (node_real.cpu < cpu_count) {
     CRANE_WARN(
         "Node health check fail. config cpu_count: {}, real cpu_count: {}",
         cpu_count, node_real.cpu);
-    CranedReportHealth_(false, reason);
+    CranedReportHealth_(false, "CPU count mismatch");
     return;
   }
 
   uint64_t mem_bytes_config = node_config->allocatable_res.memory_bytes;
   double mem_gb_config =
       static_cast<double>(mem_bytes_config) / (1024 * 1024 * 1024);
 
   if (std::abs(node_real.memory_gb - mem_gb_config) > kMemoryToleranceGB) {
-    CranedReportHealth_(false, reason);
+    CranedReportHealth_(false, "Memory size mismatch");
     CRANE_WARN("Node health check fail. config_mem : {:.3f}, real_mem : {:.3f}",
                mem_gb_config, node_real.memory_gb);
     return;
   }
 
   for (const auto& device_pair : Common::g_this_node_device) {
     const auto& device = device_pair.second;
     for (const auto& file_meta : device->device_file_metas) {
       if (!std::filesystem::exists(file_meta.path)) {
         CRANE_WARN("Node health check fail. Device file {} not found.",
                    file_meta.path);
-        CranedReportHealth_(false, reason);
+        CranedReportHealth_(false, fmt::format("Device file {} not found", file_meta.path));
         return;
       }
     }
   }
 
   CRANE_DEBUG("Node health check success.");
-  CranedReportHealth_(true, reason);
+  CranedReportHealth_(true, "Health check passed");
   return;
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41ae069 and f454033.

📒 Files selected for processing (4)

etc/config.yaml (2 hunks)
src/Craned/Core/Craned.cpp (1 hunks)
src/Craned/Core/CranedPublicDefs.h (2 hunks)
src/Craned/Core/CtldClient.cpp (3 hunks)

✅ Files skipped from review due to trivial changes (1)

src/Craned/Core/CranedPublicDefs.h

🚧 Files skipped from review as they are similar to previous changes (2)

src/Craned/Core/Craned.cpp
etc/config.yaml

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

📚 Learning: 2025-06-23T07:53:30.513Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-05-09T01:54:21.256Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-08-14T02:56:35.503Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-06-30T08:43:44.470Z

Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-06-23T07:55:19.564Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CtldClient.cpp:32-38
Timestamp: 2025-06-23T07:55:19.564Z
Learning: 在uvw库的timer_handle回调中，返回true会保持定时器继续运行，返回false会停止定时器。在src/Craned/CtldClient.cpp中，定时器回调有意返回false来停止定时器，然后通过idle处理器根据m_check_reg_timeout_条件重新启动，这是正确的设计模式。

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-04-27T11:52:31.017Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-06-07T10:47:59.071Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-04-27T06:47:51.751Z

Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

src/Craned/Core/CtldClient.cpp

📚 Learning: 2025-05-02T07:06:36.103Z

Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

src/Craned/Core/CtldClient.cpp

🧬 Code graph analysis (1)

src/Craned/Core/CtldClient.cpp (2)

src/Craned/Core/CranedPublicDefs.h (1)

g_config (156-156)

src/Utilities/PublicHeader/OS.cpp (2)

GetNodeInfo (33-62)

GetNodeInfo (33-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

src/Craned/Core/CtldClient.cpp

feat: add MemConfigMatchCheck

c61c8a6

coderabbitai bot reviewed Oct 26, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

format

1c1c9b6

coderabbitai bot reviewed Oct 26, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

L-Xiafeng requested changes Oct 27, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Show resolved Hide resolved

protos/Crane.proto Outdated Show resolved Hide resolved

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

thread to uvw::timer

4c9e1fb

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

src/CraneCtld/RpcService/CtldGrpcServer.cpp Show resolved Hide resolved

adjust config check

04334ab

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

edragain2nd added 2 commits October 27, 2025 23:43

add partical gres surpport

fc1043d

fix(1)

8369e1b

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

L-Xiafeng reviewed Oct 28, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

src/CraneCtld/CranedMetaContainer.cpp Outdated Show resolved Hide resolved

L-Xiafeng reviewed Oct 28, 2025

View reviewed changes

protos/Crane.proto Outdated Show resolved Hide resolved

L-Xiafeng reviewed Oct 28, 2025

View reviewed changes

src/CraneCtld/CranedMetaContainer.cpp Outdated Show resolved Hide resolved

add param reason

7327317

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

src/CraneCtld/CranedMetaContainer.cpp Outdated Show resolved Hide resolved

src/Craned/Core/CtldClient.cpp Show resolved Hide resolved

add &

d4b32e7

L-Xiafeng changed the title ~~feat: add MemConfigMatchCheck~~ feat: Default node health check Oct 29, 2025

L-Xiafeng added the enhancement New feature or request label Oct 29, 2025

L-Xiafeng requested changes Oct 29, 2025

View reviewed changes

unified naming

1c983fa

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

src/CraneCtld/CranedMetaContainer.cpp Show resolved Hide resolved

rename timer

3ba658a

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

src/Craned/Core/CtldClient.h Show resolved Hide resolved

L-Xiafeng reviewed Oct 29, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Outdated Show resolved Hide resolved

modify deadline

c548ea3

L-Xiafeng reviewed Oct 29, 2025

View reviewed changes

src/Craned/Core/Craned.cpp Outdated Show resolved Hide resolved

etc/config.yaml Outdated Show resolved Hide resolved

modify the defult value of NodeHealthCheckInterval

41ae069

L-Xiafeng reviewed Oct 30, 2025

View reviewed changes

etc/config.yaml Outdated Show resolved Hide resolved

modify yaml

f454033

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

src/Craned/Core/CtldClient.cpp Show resolved Hide resolved

formate

cba19d1

github-actions bot added the test-passed Build and test success label Oct 30, 2025

feat: Default node health check #655

Are you sure you want to change the base?

feat: Default node health check #655

Conversation

edragain2nd commented Oct 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

L-Xiafeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

edragain2nd commented Oct 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 26, 2025 •

edited

Loading