-
Notifications
You must be signed in to change notification settings - Fork 32
feat: Default node health check #655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdds a Craned->Ctld health-report RPC and handler. Craned schedules periodic node health checks and reports results. Ctld updates node drain state from health reports. Protobuf, Ctld gRPC handler, meta-container state update helper, client timer/check/reporting, config, and public constants added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Craned as Craned (CtldClient)
participant Ctld as CraneCtld (CtldGrpcServer)
participant Meta as MetaContainer
loop every NodeHealthCheckInterval
Craned->>Craned: NodeHealthCheck_() — check CPU/memory, device files
alt healthy
Craned->>Ctld: CranedReportHealthRequest(healthy=true, reason="")
else unhealthy
Craned->>Ctld: CranedReportHealthRequest(healthy=false, reason="...reason...")
end
Ctld->>Meta: UpdateNodeStateWithNodeHealthCheck(craned_id, healthy, reason)
Note right of Meta: set/clear drain state and update `state_reason`
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
70-70: Consider increasing the memory tolerance value.A tolerance of 0.01 GB (10 MB) may be too strict for production environments. OS kernel buffers, caches, and normal memory usage variations can easily exceed 10 MB, potentially causing false positives that incorrectly drain healthy nodes. Consider increasing this to at least 0.5-1.0 GB to account for typical system memory fluctuations.
-inline constexpr double kMemoryToleranceGB = 0.01; +inline constexpr double kMemoryToleranceGB = 0.5;src/Craned/Core/CtldClient.cpp (1)
469-476: Add thread naming for better debugging.The new thread should be named using
util::SetCurrentThreadName()for easier identification during debugging and monitoring, following the pattern used in other threads in this file (e.g., line 358).m_mem_config_check_thread_ = std::thread([this] { + util::SetCurrentThreadName("MemConfigCheck"); do { std::this_thread::sleep_for(std::chrono::seconds(5)); if (m_stopping_ || !m_stub_) return;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
protos/Crane.proto(2 hunks)src/CraneCtld/CranedMetaContainer.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.h(1 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)src/Craned/Core/CtldClient.h(2 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/CtldClient.h
🧬 Code graph analysis (4)
src/CraneCtld/CranedMetaContainer.h (1)
src/CraneCtld/CranedMetaContainer.cpp (2)
UpdateNodeStateWithMemConfigCheck_(856-877)UpdateNodeStateWithMemConfigCheck_(856-857)
src/CraneCtld/RpcService/CtldGrpcServer.h (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
SendMemConfigCheckResult(172-180)SendMemConfigCheckResult(172-175)
src/Craned/Core/CtldClient.h (1)
src/Craned/Core/CtldClient.cpp (4)
SendMemConfigCheckResult_(745-759)SendMemConfigCheckResult_(745-745)MemConfigCheck_(761-793)MemConfigCheck_(761-761)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
🔇 Additional comments (9)
src/Craned/Core/CtldClient.h (1)
212-214: LGTM!The method declarations and thread member follow the existing patterns in the class and are appropriately placed.
Also applies to: 244-244
src/CraneCtld/RpcService/CtldGrpcServer.h (1)
201-204: LGTM!The gRPC handler declaration follows the standard pattern and is consistent with other handlers in the service.
src/CraneCtld/CranedMetaContainer.h (1)
92-93: LGTM!The method declaration is clear and appropriately positioned in the public section of the class.
protos/Crane.proto (1)
941-944: LGTM!The protobuf message and RPC definitions are clean and follow the existing patterns in the service definition.
Also applies to: 1015-1015
src/Craned/Core/CtldClient.cpp (2)
473-473: Address the TODO comment.The TODO comment suggests that state checking might be needed before running MemConfigCheck_. Should this check whether the client state machine is in the READY state before performing the memory configuration check?
745-759: LGTM!The SendMemConfigCheckResult_ implementation correctly sends the result via gRPC with appropriate guards for stopping state and stub validity.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
172-180: LGTM!The gRPC handler implementation correctly delegates to the meta container and follows the pattern of other simple handlers in this file. Error handling for unknown nodes is appropriately managed within UpdateNodeStateWithMemConfigCheck_.
src/CraneCtld/CranedMetaContainer.cpp (2)
885-891: LGTM!The added early return for invalid partition improves validation and provides clearer error messages. This strengthens the partition access control logic.
867-874: No changes needed; code logic correctly guards drain state transitions.The code already prevents the scenario described. When an admin manually drains a node with a custom reason (line 801:
state_reason = request.reason()), the MemConfig check logic cannot overwrite it because:
- Line 868 checks:
state_reason == MemConfigCheckFailedReasonbefore any mem-config-triggered drain modification- Line 871 only sets the mem-config reason if the node was not already drained (
!craned_meta->drain)If admin drains first with reason "hardware maintenance", a subsequent mem-config failure will not enter either condition, preserving the manual reason. When mem-config later passes, it similarly won't clear a manually-set drain because the state_reason won't match. The design correctly isolates mem-config drain management from manual drains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/Craned/Core/CtldClient.cpp (1)
781-784: Fix incorrect CRANE_DEBUG format string.This issue was already flagged in the past review comments. The format string has no placeholders but arguments are provided, and they won't be logged.
Apply this diff:
if (std::abs(node.memory_gb - config_cpu_mem_gb) <= kMemoryToleranceGB) { - CRANE_DEBUG("MemConfigCheck success.", config_cpu_mem_gb, node.memory_gb); + CRANE_DEBUG("MemConfigCheck success. config_mem: {:.3f} GB, real_mem: {:.3f} GB", + config_cpu_mem_gb, node.memory_gb); SendMemConfigCheckResult_(true); return;
🧹 Nitpick comments (4)
src/Craned/Core/CtldClient.cpp (4)
469-476: Set thread name and address the TODO comment.The thread should have a descriptive name for debugging purposes, consistent with other threads in this file.
Apply this diff to set the thread name:
m_mem_config_check_thread_ = std::thread([this] { + util::SetCurrentThreadName("MemConfigChkThr"); do { std::this_thread::sleep_for(std::chrono::seconds(5)); if (m_stopping_ || !m_stub_) return; - // TODO: should check state? MemConfigCheck_(); } while (true); });Regarding the TODO comment: Consider whether the check should verify
g_ctld_client_sm->IsReadyNow()before callingMemConfigCheck_(), similar to the pattern used inAsyncSendThread_()at line 646. The current implementation already checksg_server->ReadyFor(RequestSource::CTLD)insideMemConfigCheck_()at line 762, which may be sufficient.
769-772: Clarify or remove the CPU check.The silent return when
node.cpu == 0is unclear. If this is a valid edge case, add a log message explaining why the check is skipped. If it's defensive coding against bad data, consider logging a warning.Consider adding a log message:
// TODO add gres support if (node.cpu == 0) { + CRANE_TRACE("Skipping MemConfigCheck: node.cpu is 0"); return; }
785-790: Remove unnecessary return statements.The
returnstatements at lines 789 and 790 are at the end of their respective branches and are unnecessary.Apply this diff:
} else { SendMemConfigCheckResult_(false); CRANE_DEBUG("MemConfigCheck fail. config_mem : {:.3f}, real_mem : {:.3f}", config_cpu_mem_gb, node.memory_gb); - return; }
769-769: Fix typo in comment."surpport" should be "support".
-// TODO add gres surpport +// TODO add gres support
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.h(1 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/CraneCtld/RpcService/CtldGrpcServer.h
- src/Utilities/PublicHeader/include/crane/PublicHeader.h
- src/CraneCtld/RpcService/CtldGrpcServer.cpp
- src/CraneCtld/CranedMetaContainer.h
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
🔇 Additional comments (1)
src/Craned/Core/CtldClient.cpp (1)
386-386: LGTM!Properly joins the memory config check thread on destruction, consistent with other thread cleanup in the destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
etc/config.yaml (1)
145-145: Add inline documentation for MemCheckInterval.The new configuration parameter lacks documentation explaining its purpose and units. Consider adding a comment to help users understand what this controls.
Apply this diff to add documentation:
+# Interval in seconds for craned to check if node memory matches configured value MemCheckInterval: 30src/Craned/Core/CranedPublicDefs.h (1)
152-153: Add documentation comment for MemCheckInterval field.The new field lacks a comment explaining its purpose and units. Other configuration fields in this struct have documentation.
Apply this diff to add documentation:
+ // Interval in seconds for memory configuration health checks uint64_t MemCheckInterval; };protos/Crane.proto (1)
941-944: Add documentation for CranedReportHealthRequest message.The new message lacks documentation explaining the purpose of the
matchedfield. Without context, it's unclear what condition is being reported.Apply this diff to add documentation:
+// Request from Craned to report memory configuration health status message CranedReportHealthRequest { string craned_id = 1; + // Whether node's actual memory matches configured value within tolerance bool matched = 2; }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
etc/config.yaml(1 hunks)protos/Crane.proto(2 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.h(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/CranedPublicDefs.h(1 hunks)src/Craned/Core/CtldClient.cpp(2 hunks)src/Craned/Core/CtldClient.h(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/Craned/Core/Craned.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
- src/CraneCtld/RpcService/CtldGrpcServer.h
- src/Craned/Core/CtldClient.h
- src/Craned/Core/CtldClient.cpp
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (1)
protos/Crane.proto (1)
1015-1015: LGTM!The RPC definition follows the established pattern for internal service RPCs and is appropriately placed in the
CraneCtldForInternalservice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/Craned/Core/CtldClient.cpp (2)
775-780: Use.find()to avoid potential exception from.at().While the
.contains()check at line 775 prevents most issues, using.at()at line 780 still risksstd::out_of_rangein edge cases. A previous review comment suggested using.find()for safer access.Apply this diff:
- if (!g_config.CranedRes.contains(g_config.Hostname)) { + auto it = g_config.CranedRes.find(g_config.Hostname); + if (it == g_config.CranedRes.end()) { CRANE_ERROR("Failed to get node config info."); return; } - auto node_config = g_config.CranedRes.at(g_config.Hostname); + auto node_config = it->second;
805-805: Remove unnecessary return statement.The
returnat the end of a void function is unnecessary.Apply this diff:
CRANE_DEBUG("ConfigMatchCheck success."); CranedReportHealth_(true); - return; }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
etc/config.yaml(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/CranedPublicDefs.h(1 hunks)src/Craned/Core/CtldClient.cpp(2 hunks)src/Craned/Core/CtldClient.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Craned/Core/Craned.cpp
- etc/config.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
src/Craned/Core/CtldClient.h (1)
src/Craned/Core/CtldClient.cpp (4)
CranedReportHealth_(747-763)CranedReportHealth_(747-747)ConfigMatchCheck_(765-806)ConfigMatchCheck_(765-765)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (4)
src/Craned/Core/CranedPublicDefs.h (1)
153-153: LGTM!The addition of
ConfigCheckIntervalto the Config struct is clean and follows the existing pattern.src/Craned/Core/CtldClient.h (1)
212-214: LGTM!The new method declarations and timer member follow existing patterns in the class and are appropriately placed in the private section.
Also applies to: 245-245
src/Craned/Core/CtldClient.cpp (2)
765-806: Good implementation of config health check with helpful debug logging.The implementation correctly compares CPU and memory resources against configured values and reports health status. The debug messages at lines 796 and 803 provide useful diagnostics.
381-391: I need to verify the actual implementation of them_ping_handle_callback to confirm whether it returnsboolorvoid. Let me check that:Now let me search for the actual m_ping_handle callback implementation to verify the reviewer's claim:
The review comment's assertion about callback return semantics is incorrect.
Based on verification of the codebase and uvw/libuv documentation:
Timer callbacks in libuv return void, not
bool. All timer callbacks found in the codebase (TaskScheduler.cpp, JobManager.cpp, TaskManager.cpp) consistently returnvoid.The mechanism to stop a timer is not by returning a value, but by calling
handle.stop()directly within the callback, as demonstrated in JobManager.cpp line 80:if (EvCheckSupervisorRunning_()) handle.stop();The config_check_timer callback at lines 381-391 correctly returns
voidand is consistent with all other timer callbacks in the codebase.The TODO comment at line 385 may warrant clarification, but it is unrelated to callback return types. The code implementation is correct as-is.
Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
172-180: Add readiness, allowlist, and online checks before updating node state.Mirror StepStatusChange/CranedPing preconditions: server ready; craned ID allowed; craned online. Return OK for unauthorized/offline to avoid dropping channel; use UNAVAILABLE only when server not ready. Based on learnings.
Apply within this block:
grpc::Status CtldForInternalServiceImpl::CranedReportHealth( grpc::ServerContext *context, const crane::grpc::CranedReportHealthRequest *request, google::protobuf::Empty *response) { - g_meta_container->UpdateNodeStateWithConfigCheck_(request->craned_id(), - request->matched()); - - return grpc::Status::OK; + if (!g_runtime_status.srv_ready.load(std::memory_order_acquire)) + return grpc::Status{grpc::StatusCode::UNAVAILABLE, + "CraneCtld Server is not ready"}; + + const auto &id = request->craned_id(); + if (!g_meta_container->CheckCranedAllowed(id)) { + CRANE_WARN("Reject health report from unknown node {}", id); + return grpc::Status::OK; + } + if (!g_meta_container->CheckCranedOnline(id)) { + CRANE_WARN("Reject health report from offline node {}", id); + return grpc::Status::OK; + } + + g_meta_container->UpdateNodeStateWithConfigCheck_(id, request->matched()); + return grpc::Status::OK; }src/Craned/Core/CtldClient.cpp (1)
760-763: Include gRPC error details in log.Currently logs only is_matched. Add error_code and error_message (and debug_error_string) for diagnostics. This was previously requested and still appears unresolved in this hunk.
Apply:
- if (!result.ok()) { - CRANE_ERROR("CranedReportHealth_ failed: is_matched={}", is_match); - } + if (!result.ok()) { + CRANE_ERROR("CranedReportHealth_ failed: is_matched={}, code={}, msg={}, ctx={}", + is_match, static_cast<int>(result.error_code()), + result.error_message(), context.debug_error_string()); + }
🧹 Nitpick comments (3)
src/CraneCtld/CranedMetaContainer.h (1)
92-93: Public API naming: drop trailing underscore or add a non-underscore wrapper.Exposed public methods shouldn’t use the trailing “” private-helper convention. Either rename to UpdateNodeStateWithConfigCheck or add a non-underscore inline wrapper and keep the “” variant private.
src/Craned/Core/CtldClient.cpp (1)
755-756: Unify RPC deadlines.Use kCranedRpcTimeoutSeconds for CranedReportHealth_ to match the rest of the client and ease tuning.
Apply:
- context.set_deadline(std::chrono::system_clock::now() + - std::chrono::seconds(1)); + context.set_deadline(std::chrono::system_clock::now() + + std::chrono::seconds(kCranedRpcTimeoutSeconds));Also applies to: 724-726
src/CraneCtld/CranedMetaContainer.cpp (1)
867-876: Avoid “changed” trace when state didn’t change.Log is emitted unconditionally; add a changed flag to avoid misleading traces when drain remains the same.
Apply:
- if (craned_meta->drain && + bool changed = false; + if (craned_meta->drain && craned_meta->state_reason == ConfigMatchCheckFailedReason) { - craned_meta->drain = !is_match; - if (is_match) craned_meta->state_reason.clear(); + bool new_drain = !is_match; + changed = (new_drain != craned_meta->drain) || (is_match && !craned_meta->state_reason.empty()); + craned_meta->drain = new_drain; + if (is_match) craned_meta->state_reason.clear(); } else if (!craned_meta->drain && !is_match) { - craned_meta->drain = true; - craned_meta->state_reason = ConfigMatchCheckFailedReason; + craned_meta->drain = true; + craned_meta->state_reason = ConfigMatchCheckFailedReason; + changed = true; } - CRANE_TRACE("ConfigMatch check: craned_id '{}' drain state changed to {}.", - craned_id, craned_meta->drain); + if (changed) + CRANE_TRACE("ConfigMatch check: craned_id '{}' drain state -> {}.", + craned_id, craned_meta->drain);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/CraneCtld/CranedMetaContainer.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Utilities/PublicHeader/include/crane/PublicHeader.h
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/CtldClient.cpp
🧬 Code graph analysis (2)
src/CraneCtld/CranedMetaContainer.h (1)
src/CraneCtld/CranedMetaContainer.cpp (2)
UpdateNodeStateWithConfigCheck_(856-877)UpdateNodeStateWithConfigCheck_(856-857)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (1)
src/Craned/Core/CtldClient.cpp (1)
382-392: Timer callback consistency and lifecycle.Other uvw timer handlers in this file return a boolean; this one doesn’t return anything. Align the signature/return to avoid handler mismatch and subtle behavior differences.
Apply:
m_config_check_timer_->on<uvw::timer_event>( [this](const uvw::timer_event&, uvw::timer_handle& h) { if (m_stopping_ || !m_stub_) return; // TODO: should check state? ConfigMatchCheck_(); - }); + return true; + });If uvw expects a void-returning listener here, confirm compilation; otherwise adapt all timer callbacks to a single consistent signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/Craned/Core/CtldClient.cpp (1)
777-782: Use CranedIdOfThisNode to index CranedRes consistently.The code uses
g_config.Hostnameto look up node configuration, but this was flagged in past reviews and should useg_config.CranedIdOfThisNodeinstead for consistency with other parts of the codebase (e.g., CranedRegister at line 558).Apply this diff:
- if (!g_config.CranedRes.contains(g_config.Hostname)) { + if (!g_config.CranedRes.contains(g_config.CranedIdOfThisNode)) { CRANE_ERROR("Failed to get node config info."); return; } - auto node_config = g_config.CranedRes.at(g_config.Hostname); + auto node_config = g_config.CranedRes.at(g_config.CranedIdOfThisNode);Based on learnings.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
172-180: Add validation checks to prevent unauthorized health reports.The handler lacks critical validation checks present in similar internal RPC handlers. This could allow unauthorized or offline nodes to modify drain state.
Apply this diff to add validation:
grpc::Status CtldForInternalServiceImpl::CranedReportHealth( grpc::ServerContext *context, const crane::grpc::CranedReportHealthRequest *request, google::protobuf::Empty *response) { + if (!g_runtime_status.srv_ready.load(std::memory_order_acquire)) + return grpc::Status{grpc::StatusCode::UNAVAILABLE, + "CraneCtld Server is not ready"}; + + if (!g_meta_container->CheckCranedAllowed(request->craned_id())) { + CRANE_WARN("Reject health report from unknown node {}", + request->craned_id()); + return grpc::Status::OK; + } + + if (!g_meta_container->CheckCranedOnline(request->craned_id())) { + CRANE_WARN("Reject health report from offline node {}", + request->craned_id()); + return grpc::Status::OK; + } + g_meta_container->UpdateNodeStateWithHealthCheck_( request->craned_id(), request->healthy(), request->reason()); return grpc::Status::OK; }Based on learnings.
🧹 Nitpick comments (2)
src/Craned/Core/CtldClient.cpp (2)
381-392: Consider removing redundant TODO comment.The TODO at Line 386 asks whether to check state, but
ConfigMatchCheck_()already checks server readiness at Line 768. The current approach is correct—checking inside the function keeps the logic centralized.If you prefer to avoid the function call overhead when not ready, check the state in the timer callback:
m_config_check_timer_->on<uvw::timer_event>( [this](const uvw::timer_event&, uvw::timer_handle& h) { - if (m_stopping_ || !m_stub_) return; - // TODO: should check state? + if (m_stopping_ || !m_stub_ || !g_ctld_client_sm->IsReadyNow()) return; ConfigMatchCheck_(); });
789-806: Standardize the order of logging and health reporting.The CPU check (lines 790-794) logs before calling
CranedReportHealth_, while the memory check (lines 801-805) callsCranedReportHealth_first then logs. Standardizing the order improves maintainability.Apply this diff to make the memory check consistent with the CPU check:
if (std::abs(node_real.memory_gb - mem_gb_config) > kMemoryToleranceGB) { - CranedReportHealth_(false, reason); CRANE_WARN("ConfigMatchCheck fail. config_mem : {:.3f}, real_mem : {:.3f}", mem_gb_config, node_real.memory_gb); + CranedReportHealth_(false, reason); return; }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
protos/Crane.proto(2 hunks)src/CraneCtld/CranedMetaContainer.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)src/Craned/Core/CtldClient.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- protos/Crane.proto
- src/CraneCtld/CranedMetaContainer.h
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/CtldClient.cpp
🧬 Code graph analysis (2)
src/Craned/Core/CtldClient.h (1)
src/Craned/Core/CtldClient.cpp (4)
CranedReportHealth_(748-765)CranedReportHealth_(748-748)ConfigMatchCheck_(767-823)ConfigMatchCheck_(767-767)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
L-Xiafeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有的命名都统一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
72-73: Rename to follow the naming convention.All other constants in this file use a
kprefix (e.g.,kDefaultHost,kUnlimitedQosName,kHostFilePath). This constant should be renamed tokNodeHealthCheckFailedReasonfor consistency.Apply this diff to fix the naming:
-inline const char* const NodeHealthCheckFailedReason = +inline const char* const kNodeHealthCheckFailedReason = "Node failed health check";
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
etc/config.yaml(1 hunks)src/CraneCtld/CranedMetaContainer.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(1 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/CranedPublicDefs.h(1 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)src/Craned/Core/CtldClient.h(2 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/CraneCtld/RpcService/CtldGrpcServer.cpp
- src/Craned/Core/Craned.cpp
- src/Craned/Core/CtldClient.h
- src/Craned/Core/CtldClient.cpp
- etc/config.yaml
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/CranedMetaContainer.h (1)
src/CraneCtld/CranedMetaContainer.cpp (2)
UpdateNodeStateWithNodeHealthCheck(856-878)UpdateNodeStateWithNodeHealthCheck(856-857)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (3)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
70-70: LGTM!The constant follows the naming convention and the tolerance value of 0.01 GB (~10 MB) is reasonable for node health checks.
src/CraneCtld/CranedMetaContainer.h (1)
92-94: LGTM!The method declaration is well-formed with appropriate parameter types and passing conventions.
src/Craned/Core/CranedPublicDefs.h (1)
153-153: No changes required; zero-value handling is already implemented gracefully.The code explicitly checks
if (g_config.NodeHealthCheckInterval > 0)before starting the health-check timer in CtldClient.cpp. If the value is zero (either by explicit config or any other means), the timer simply does not start—effectively disabling the feature without error. Configuration parsing in Craned.cpp provides a sensible default of 30 seconds, and explicit zero values are handled safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/Craned/Core/CtldClient.h (2)
214-215: Consider adding documentation for the health check method.While the method signature is clear, a brief comment explaining that this performs periodic node resource validation (CPU, memory, devices) would improve maintainability.
+ // Performs periodic health checks on node resources (CPU, memory, devices). void NodeHealthCheck_();
244-245: Design pattern looks good; consider adding documentation.The timer member follows the established pattern with
m_ping_handle_and is appropriately placed. Adding a brief comment would clarify its purpose.+ // Timer for scheduling periodic node health checks. std::shared_ptr<uvw::timer_handle> m_node_health_check_timer_;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/Craned/Core/CtldClient.cpp(3 hunks)src/Craned/Core/CtldClient.h(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Craned/Core/CtldClient.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Core/CtldClient.h (1)
src/Craned/Core/CtldClient.cpp (4)
CranedReportHealth_(748-765)CranedReportHealth_(748-748)NodeHealthCheck_(767-823)NodeHealthCheck_(767-767)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/Craned/Core/CtldClient.cpp (2)
749-766: [Unresolved from past review] Improve error logging and optimize parameter passing.Past review comments on these lines remain unaddressed:
- Line 749: The
reasonparameter should be passed byconst std::string&to avoid unnecessary copies.- Line 764: The error log lacks diagnostic details—include
result.error_code(),result.error_message(), andcontext.debug_error_string().Apply this diff:
-void CtldClient::CranedReportHealth_(bool is_healthy, std::string reason) { +void CtldClient::CranedReportHealth_(bool is_healthy, const std::string& reason) { if (m_stopping_ || !m_stub_) return; grpc::ClientContext context; crane::grpc::CranedReportHealthRequest request; google::protobuf::Empty reply; context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(kCranedRpcTimeoutSeconds)); request.set_craned_id(g_config.CranedIdOfThisNode); request.set_healthy(is_healthy); request.set_reason(reason); auto result = m_stub_->CranedReportHealth(&context, request, &reply); if (!result.ok()) { - CRANE_ERROR("CranedReportHealth failed: is_healthy={}", is_healthy); + CRANE_ERROR("CranedReportHealth failed: is_healthy={}, reason: {} | {}, code: {}", + is_healthy, result.error_message(), context.debug_error_string(), + static_cast<int>(result.error_code())); } }
778-783: [Unresolved from past review] Use CranedIdOfThisNode to index CranedRes.Past review flagged this as a major issue: using
g_config.Hostnameinstead ofg_config.CranedIdOfThisNodecan cause lookup mismatches if the names differ. Based on retrieved learnings,g_config.CranedRes[g_config.CranedIdOfThisNode]is guaranteed to be valid.Apply this diff:
- if (!g_config.CranedRes.contains(g_config.Hostname)) { + if (!g_config.CranedRes.contains(g_config.CranedIdOfThisNode)) { CRANE_ERROR("Failed to get node config info."); return; } - auto node_config = g_config.CranedRes.at(g_config.Hostname); + auto node_config = g_config.CranedRes.at(g_config.CranedIdOfThisNode);
🧹 Nitpick comments (1)
src/Craned/Core/CtldClient.cpp (1)
787-822: Consider customizing the reason string for each failure type.Currently, all health check failures report the same constant reason string (
NodeHealthCheckFailedReason), regardless of whether the CPU check, memory check, or device file check failed. This reduces diagnostic value for operators receiving the health report.Consider customizing the reason string at each failure point:
- std::string reason = NodeHealthCheckFailedReason; int64_t cpu_count = static_cast<int64_t>(node_config->allocatable_res.cpu_count); if (node_real.cpu < cpu_count) { CRANE_WARN( "Node health check fail. config cpu_count: {}, real cpu_count: {}", cpu_count, node_real.cpu); - CranedReportHealth_(false, reason); + CranedReportHealth_(false, "CPU count mismatch"); return; } uint64_t mem_bytes_config = node_config->allocatable_res.memory_bytes; double mem_gb_config = static_cast<double>(mem_bytes_config) / (1024 * 1024 * 1024); if (std::abs(node_real.memory_gb - mem_gb_config) > kMemoryToleranceGB) { - CranedReportHealth_(false, reason); + CranedReportHealth_(false, "Memory size mismatch"); CRANE_WARN("Node health check fail. config_mem : {:.3f}, real_mem : {:.3f}", mem_gb_config, node_real.memory_gb); return; } for (const auto& device_pair : Common::g_this_node_device) { const auto& device = device_pair.second; for (const auto& file_meta : device->device_file_metas) { if (!std::filesystem::exists(file_meta.path)) { CRANE_WARN("Node health check fail. Device file {} not found.", file_meta.path); - CranedReportHealth_(false, reason); + CranedReportHealth_(false, fmt::format("Device file {} not found", file_meta.path)); return; } } } CRANE_DEBUG("Node health check success."); - CranedReportHealth_(true, reason); + CranedReportHealth_(true, "Health check passed"); return; }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
etc/config.yaml(2 hunks)src/Craned/Core/Craned.cpp(1 hunks)src/Craned/Core/CranedPublicDefs.h(2 hunks)src/Craned/Core/CtldClient.cpp(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/Craned/Core/CranedPublicDefs.h
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Craned/Core/Craned.cpp
- etc/config.yaml
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-08-14T02:56:35.503Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-06-23T07:55:19.564Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CtldClient.cpp:32-38
Timestamp: 2025-06-23T07:55:19.564Z
Learning: 在uvw库的timer_handle回调中,返回true会保持定时器继续运行,返回false会停止定时器。在src/Craned/CtldClient.cpp中,定时器回调有意返回false来停止定时器,然后通过idle处理器根据m_check_reg_timeout_条件重新启动,这是正确的设计模式。
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-04-27T11:52:31.017Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/Craned/Core/CtldClient.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CtldClient.cpp
🧬 Code graph analysis (1)
src/Craned/Core/CtldClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(156-156)src/Utilities/PublicHeader/OS.cpp (2)
GetNodeInfo(33-62)GetNodeInfo(33-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
Summary by CodeRabbit
New Features
Configuration
Other