Skip to content

Conversation

@1daidai1
Copy link
Collaborator

@1daidai1 1daidai1 commented Nov 10, 2025

Summary by CodeRabbit

  • New Features

    • Streaming job-summary and job-size query endpoints with multi-field filters, grouping, batched streaming replies, and a manual aggregation trigger that reports result.
    • New summary/result types for job summaries and job-size summaries.
  • Chores

    • Background hourly/day/month rollups with lifecycle management, index initialization, startup DB initialization and graceful thread shutdown.
    • Data wipe/import script extended with new modes and bulk-import helpers.
  • Documentation

    • Comprehensive creport CLI docs and updated navigation/See Also links across command pages.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

Walkthrough

Adds protobuf messages and streaming RPCs for job summaries and manual aggregation; implements MongoDB hourly/day/month rollups with a background thread and startup DB init; extends task documents with nodename_list; adds time-range splitting utilities, script import/wipe modes, and creport documentation and nav updates.

Changes

Cohort / File(s) Summary
Protos & PublicDefs
protos/Crane.proto, protos/PublicDefs.proto
Add messages: QueryJobSummaryRequest/Reply, QueryJobSizeSummaryRequest/Reply, ActiveAggregationManuallyRequest/Reply, JobSummaryItem, JobSizeSummaryItem; add streaming RPCs QueryJobSummary, QueryJobSizeSummary; replace Empty with typed request/reply for ActiveAggregationManually.
DB client (header & impl)
src/CraneCtld/DbClient.h, src/CraneCtld/DbClient.cpp
Add RollupType enum, JobSizeSummaryKey/JobSizeSummaryResult, Init/InitTableIndexes/CreateCollectionIndex, aggregation pipelines (hour/day/month), AppendUnionWithRanges template, QueryJobSummary/QueryJobSizeSummary (streaming) and QueryJobSizeSummaryByJobIds, ClusterRollupUsage/RollupSummary, background rollup thread and lifecycle, add nodename_list to task document construction, helpers/constants, destructor.
gRPC service & server init
src/CraneCtld/RpcService/CtldGrpcServer.h, src/CraneCtld/RpcService/CtldGrpcServer.cpp, src/CraneCtld/CraneCtld.cpp, src/CraneCtld/CtldPreCompiledHeader.h
Declare/implement RPC handlers QueryJobSummary, QueryJobSizeSummary, ActiveAggregationManually; call g_db_client->Init() at startup (exit on failure); add <grpc++/server.h> include.
Time utilities
src/Utilities/PublicHeader/include/crane/String.h, src/Utilities/PublicHeader/String.cpp
Add TimeRange struct, FormatTime, TryPushHours/Days/Months, and EfficientSplitTimeRange for splitting ranges into hour/day/month segments.
Maintenance script
scripts/wipe_data.sh
Expand usage/help with new modes (7,8), add batch_insert_collection and batch_insert_collection_process, support wiping/inserting hour/day/month/summary_time tables and bulk-import mode.
Docs & navigation
docs/**, mkdocs.yaml
Add comprehensive creport docs (EN/ZH), update related command pages to reference creport, and add navigation entries in mkdocs.yaml.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Ctld_gRPC as Ctld gRPC
  participant DBClient as MongodbClient
  participant MongoDB
  participant RollupTh as RollupThread

  Client->>Ctld_gRPC: QueryJobSummary(request)
  Ctld_gRPC->>DBClient: QueryJobSummary(request, stream)
  loop per time-range chunk
    DBClient->>MongoDB: Aggregate (union of hour/day/month collections)
    MongoDB-->>DBClient: cursor/results
    DBClient->>Ctld_gRPC: stream.Write(QueryJobSummaryReply batch)
  end
  Ctld_gRPC-->>Client: grpc::Status OK

  Note over RollupTh,DBClient: Periodic background rollup
  RollupTh->>DBClient: ClusterRollupUsage()
  DBClient->>MongoDB: Aggregate task_table -> upsert hour_job_summary_table
  DBClient->>MongoDB: Aggregate hour->day and day->month as scheduled
  DBClient->>MongoDB: Update summary_time_table
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Extra attention:
    • src/CraneCtld/DbClient.{h,cpp}: aggregation pipelines, AppendUnionWithRanges template, upsert/merge correctness, streaming/batching logic.
    • Threading/lifecycle: MongoDbSumaryTh_, m_thread_stop_, destructor cleanup, startup g_db_client->Init() behavior.
    • Time utilities: EfficientSplitTimeRange correctness across boundaries and DST/UTC considerations.
    • Schema compatibility: adding nodename_list to task documents and downstream consumers.
    • RPC handlers: correct grpc::ServerWriter usage and backpressure handling.

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • L-Xiafeng
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.80% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Add creport cmd' is vague and generic, using abbreviated language ('cmd' instead of 'command') that obscures the actual scope and nature of the changes. Consider a more descriptive title such as 'Add creport command for job statistics queries and manual aggregation' or 'Implement creport CLI tool with job summary and aggregation features' to better convey the substantial changes across protobuf definitions, database client, RPC handlers, and documentation.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/mongodbtest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (11)
scripts/wipe_data.sh (2)

33-47: Add error handling to batch import functions.

The new batch_insert_collection() and batch_insert_collection_process() functions execute mongoimport without checking for errors or validating that the JSON file exists. This can silently fail or produce confusing output. Add explicit error handling and validation.

 function batch_insert_collection() {
   collection=$1
   json_file=$2
+  if [ ! -f "$json_file" ]; then
+    echo "Error: JSON file not found: $json_file" >&2
+    return 1
+  fi
   mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
     --authenticationDatabase admin --db "$dbname" --collection "$collection" \
     --file "$json_file" --jsonArray
+  if [ $? -ne 0 ]; then
+    echo "Error: mongoimport failed for $collection from $json_file" >&2
+    return 1
+  fi
 }
 
 function batch_insert_collection_process() {
   collection=$1
   json_file=$2
+  if [ ! -f "$json_file" ]; then
+    echo "Error: JSON file not found: $json_file" >&2
+    return 1
+  fi
   mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
     --authenticationDatabase admin --db "$dbname" --collection "$collection" \
     --file "$json_file"
+  if [ $? -ne 0 ]; then
+    echo "Error: mongoimport failed for $collection from $json_file" >&2
+    return 1
+  fi
 }

82-85: Use absolute path instead of relative path for batch import.

Line 83 uses a relative path ./tasks_bulk.json which is fragile if the script is invoked from a different directory. Consider using an absolute path or resolving the path relative to the script's location.

 if [ "$mode" -eq 10 ]; then
-  batch_insert_collection_process "task_table" "./tasks_bulk.json"
+  script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+  batch_insert_collection_process "task_table" "$script_dir/tasks_bulk.json"
   exit 0
 fi
src/Utilities/PublicHeader/include/crane/String.h (1)

42-46: Use an enum for TimeRange granularity instead of string literals.

Strings ("hour"/"day"/"month") are error‑prone and add allocation/compare cost. Replace with a small enum class and convert at boundaries that need strings.

Example:

-struct TimeRange {
-  std::time_t start;
-  std::time_t end;
-  std::string type;  // "month" / "day" / "hour"
-};
+enum class TimeRangeType : uint8_t { Hour, Day, Month };
+struct TimeRange {
+  std::time_t start{};
+  std::time_t end{};
+  TimeRangeType type{};
+};
src/Utilities/PublicHeader/String.cpp (1)

527-533: Minor: remove redundant inline and mark noexcept.

This function lives in a .cpp; inline is unnecessary. Also consider noexcept.

-inline std::string FormatTime(std::time_t t) {
+std::string FormatTime(std::time_t t) noexcept {
src/CraneCtld/CtldPreCompiledHeader.h (1)

88-89: Include hygiene (optional).

grpc++/grpc++.h already pulls in server APIs on most versions. Consider keeping to one style (grpcpp vs grpc++), preferably grpcpp, across the codebase to avoid subtle include path mismatches.

src/CraneCtld/RpcService/CtldGrpcServer.h (1)

379-389: Consider documenting auth/visibility for new reporting RPCs.

Most Ctld RPCs enforce readiness and sometimes auth. Clarify intended access for QueryJobSummary/QueryJobSizeSummary (admin‑only vs public) in comments to guide clients and reviewers.

protos/PublicDefs.proto (2)

814-823: Field semantics: add units and aggregation notes (optional).

Document whether total_cpu_time is seconds, hours, or core‑seconds, and what total_count counts (jobs? steps?) to avoid downstream misinterpretation.


824-831: Bucket key typing (optional).

cpus_alloc as uint32 is fine; if buckets are used, consider naming cpus_alloc_bucket for clarity in clients.

protos/Crane.proto (3)

977-989: Define time‑range semantics and required invariants.

Please document whether filter_end_time is exclusive or inclusive and require start < end to avoid ambiguous queries. Consider rejecting invalid ranges server‑side.


990-1006: Consistency nits: field naming and filters.

  • filter_qoss: consider filter_qos or filter_qos_list for consistency with other protos.
  • filter_nodename_list vs other filter_* naming: consider filter_nodes for symmetry.

1065-1067: Streaming RPCs: add brief comments about batching size and backpressure.

Since replies are batched server‑side, noting batch limits in proto helps clients tune read loops.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6391e9e and c471b59.

📒 Files selected for processing (11)
  • protos/Crane.proto (2 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPreCompiledHeader.h (1 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/DbClient.h (5 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
  • src/Utilities/PublicHeader/String.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/CtldPreCompiledHeader.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/CtldPreCompiledHeader.h
  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/String.h
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/DbClient.h
🧬 Code graph analysis (6)
src/CraneCtld/RpcService/CtldGrpcServer.h (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2680-2784)
  • QueryJobSummary (2680-2682)
  • QueryJobSizeSummary (2786-2911)
  • QueryJobSizeSummary (2786-2788)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2052)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2054-2064)
  • QueryJobSizeSummary (2054-2057)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2680-2784)
  • QueryJobSummary (2680-2682)
  • QueryJobSizeSummary (2786-2911)
  • QueryJobSizeSummary (2786-2788)
src/CraneCtld/DbClient.cpp (5)
src/Utilities/PublicHeader/PublicHeader.cpp (1)
  • result (772-772)
src/Utilities/PublicHeader/String.cpp (6)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • EfficientSplitTimeRange (618-636)
  • EfficientSplitTimeRange (618-619)
  • SetCurrentThreadName (354-362)
  • SetCurrentThreadName (354-354)
src/CraneCtld/DbClient.h (4)
  • string (71-73)
  • string (375-419)
  • string (377-393)
  • MongodbClient (93-221)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2052)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2054-2064)
  • QueryJobSizeSummary (2054-2057)
src/CraneCtld/EmbeddedDbClient.cpp (6)
  • Init (25-50)
  • Init (25-25)
  • Init (237-296)
  • Init (237-237)
  • Init (503-572)
  • Init (503-503)
src/Utilities/PublicHeader/include/crane/String.h (1)
src/Utilities/PublicHeader/String.cpp (10)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • TryPushHours (598-615)
  • TryPushHours (598-599)
  • TryPushMonths (538-565)
  • TryPushMonths (538-539)
  • TryPushDays (568-595)
  • TryPushDays (568-569)
  • EfficientSplitTimeRange (618-636)
  • EfficientSplitTimeRange (618-619)
src/CraneCtld/DbClient.h (2)
src/CraneCtld/DbClient.cpp (48)
  • MongodbClient (3025-3049)
  • MongodbClient (3051-3054)
  • Init (2977-3000)
  • Init (2977-2977)
  • InitTableIndexes (2940-2975)
  • InitTableIndexes (2940-2940)
  • CreateCollectionIndex (2046-2068)
  • CreateCollectionIndex (2046-2047)
  • RollupTypeToString (1866-1877)
  • RollupTypeToString (1866-1866)
  • UpdateSummaryLastSuccessTimeSec (1879-1913)
  • UpdateSummaryLastSuccessTimeSec (1879-1880)
  • GetSummaryLastSuccessTimeTm (1915-1954)
  • GetSummaryLastSuccessTimeTm (1915-1916)
  • NeedRollup (1956-1973)
  • NeedRollup (1956-1957)
  • RollupSummary (1975-2012)
  • RollupSummary (1975-1975)
  • ClusterRollupUsage (2014-2044)
  • ClusterRollupUsage (2014-2014)
  • GetCpusAlloc (2625-2634)
  • GetCpusAlloc (2625-2625)
  • GetTotalCpuTime (2636-2645)
  • GetTotalCpuTime (2636-2636)
  • GetTotalCount (2647-2654)
  • GetTotalCount (2647-2647)
  • WriteReply (2656-2678)
  • WriteReply (2656-2659)
  • QueryJobSizeSummaryByJobIds (2335-2488)
  • QueryJobSizeSummaryByJobIds (2335-2337)
  • JobQueryMatch (2490-2534)
  • JobQueryMatch (2490-2492)
  • QueryJobSummary (2680-2784)
  • QueryJobSummary (2680-2682)
  • JobSizeQueryMatch (2536-2600)
  • JobSizeQueryMatch (2536-2538)
  • QueryJobSizeSummary (2786-2911)
  • QueryJobSizeSummary (2786-2788)
  • AppendUnionWithRanges (2603-2623)
  • AppendUnionWithRanges (2603-2606)
  • AggregateJobSummaryByHour (2070-2191)
  • AggregateJobSummaryByHour (2070-2072)
  • AggregateJobSummaryByDayOrMonth (2193-2333)
  • AggregateJobSummaryByDayOrMonth (2193-2196)
  • MongoDbSumaryTh_ (3002-3023)
  • MongoDbSumaryTh_ (3002-3003)
  • MillisecondsToNextHour (2913-2938)
  • MillisecondsToNextHour (2913-2913)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2052)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2054-2064)
  • QueryJobSizeSummary (2054-2057)
🔇 Additional comments (2)
src/Utilities/PublicHeader/String.cpp (1)

617-636: Edge-case: ensure fallback interval stays “hour” only for sub-hour tail.

Current fallback emits {cur, end, "hour"} even when cur is aligned to day/month start but end < next boundary (rare). With the alignment fixes above this remains consistent; please re-run unit tests for boundaries (e.g., 00:00:30 → 01:00:00).

Would you like a quick gtest snippet covering DST boundaries and minute/second offsets?

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

2046-2052: Guard with srv_ready like other RPCs.

Align with existing pattern to return UNAVAILABLE until server ready.

 grpc::Status CraneCtldServiceImpl::QueryJobSummary(
     ::grpc::ServerContext *context,
     const ::crane::grpc::QueryJobSummaryRequest *request,
     ::grpc::ServerWriter<::crane::grpc::QueryJobSummaryReply> *writer) {
-  g_db_client->QueryJobSummary(request, writer);
+  if (!g_runtime_status.srv_ready.load(std::memory_order_acquire))
+    return {grpc::StatusCode::UNAVAILABLE, "CraneCtld Server is not ready"};
+  g_db_client->QueryJobSummary(request, writer);
   return grpc::Status::OK;
 }
⛔ Skipped due to learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 587
File: src/Craned/Supervisor/CforedClient.cpp:449-454
Timestamp: 2025-08-14T02:56:35.503Z
Learning: In CforedClient::AsyncSendRecvThread_(), the guard `if (state <= State::Registering) { continue; }` in the TIMEOUT branch only prevents premature cleanup when stopping before registration completes, but it doesn't block normal gRPC event processing. The completion queue will still deliver Prepare/Write/Read events that advance the state machine normally.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/Utilities/PublicHeader/String.cpp (3)

538-567: Critical granularity bug + missing mktime error handling (previously flagged).

Two correctness issues remain unfixed from prior review:

  1. Line 542: Checking only tm_hour == 0 is insufficient—a timestamp like 2024-01-01 00:30:00 would incorrectly qualify as a month boundary. Must also verify tm_min == 0 && tm_sec == 0.
  2. Line 553: mktime returns -1 on failure (e.g., invalid DST transitions), but the code doesn't check for this, which could corrupt the time-range split.

Apply this diff to fix both issues:

   std::tm tm_cur{};
   localtime_r(&cur, &tm_cur);
-  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
     std::time_t month_start = cur;
     std::time_t month_cur = cur;
     while (true) {
       std::tm tm_next_month{};
       localtime_r(&month_cur, &tm_next_month);
       tm_next_month.tm_mon += 1;
       tm_next_month.tm_mday = 1;
       tm_next_month.tm_hour = 0;
       tm_next_month.tm_min = 0;
       tm_next_month.tm_sec = 0;
       std::time_t next_month_first = mktime(&tm_next_month);
+      if (next_month_first == -1) break;
       if (next_month_first <= end) {
         month_cur = next_month_first;
       } else {
         break;
       }
     }

Based on learnings


570-600: Critical granularity bug + missing mktime error handling (previously flagged).

Same issues as TryPushMonths:

  1. Line 574: Only checking tm_hour == 0 allows partial-day timestamps (e.g., 12:30:00) to be bucketed as full days, causing query mismatches with day-level aggregates.
  2. Lines 582, 590: No error checks for mktime returning -1.

Apply this diff:

   std::tm tm_cur{};
   localtime_r(&cur, &tm_cur);
-  if (tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
     std::time_t day_start = cur;
     std::tm tm_next_month = tm_cur;
     tm_next_month.tm_mon += 1;
     tm_next_month.tm_mday = 1;
     tm_next_month.tm_hour = 0;
     tm_next_month.tm_min = 0;
     tm_next_month.tm_sec = 0;
     std::time_t next_month_first = mktime(&tm_next_month);
+    if (next_month_first == -1) return false;

     std::tm tm_end{};
     localtime_r(&end, &tm_end);

     tm_end.tm_hour = 0;
     tm_end.tm_min = 0;
     tm_end.tm_sec = 0;
     std::time_t end_trunc_to_day = mktime(&tm_end);
+    if (end_trunc_to_day == -1) return false;

Based on learnings


603-621: Missing mktime error handling (previously flagged).

Line 613: The mktime call can return -1 on failure, but there's no error check.

Apply this diff:

     std::tm tm_next_day = tm_cur;
     tm_next_day.tm_mday += 1;
     tm_next_day.tm_hour = 0;
     tm_next_day.tm_min = 0;
     tm_next_day.tm_sec = 0;
     std::time_t next_day = mktime(&tm_next_day);
+    if (next_day == -1) return false;

     std::time_t hour_end = std::min(next_day, end);

Based on learnings

🧹 Nitpick comments (2)
scripts/wipe_data.sh (2)

3-3: Fix usage message formatting for consistency.

The usage string still has inconsistent spacing around pipe separators. After mode 6 and before mode 8, spacing is uneven ("6:...| 7:" vs "|8:"). Standardize the separators throughout for readability.

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table)"

33-47: Add error handling and input validation to batch import functions.

The new batch_insert_collection and batch_insert_collection_process functions invoke mongoimport without checking for failures. If the import fails, the script silently continues. Add error handling to match the safety pattern used in wipe_collection, and validate that the JSON file exists before attempting import.

 function batch_insert_collection() {
   collection=$1
   json_file=$2
+  if [ ! -f "$json_file" ]; then
+    echo "Error: JSON file not found: $json_file"
+    return 1
+  fi
   mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
     --authenticationDatabase admin --db "$dbname" --collection "$collection" \
-    --file "$json_file" --jsonArray
+    --file "$json_file" --jsonArray || return 1
 }
 
 function batch_insert_collection_process() {
   collection=$1
   json_file=$2
+  if [ ! -f "$json_file" ]; then
+    echo "Error: JSON file not found: $json_file"
+    return 1
+  fi
   mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
     --authenticationDatabase admin --db "$dbname" --collection "$collection" \
-    --file "$json_file"
+    --file "$json_file" || return 1
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c471b59 and b89c440.

📒 Files selected for processing (6)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/Utilities/PublicHeader/String.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/CraneCtld/CraneCtld.cpp
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/String.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (4)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2679-2783)
  • QueryJobSummary (2679-2681)
  • QueryJobSizeSummary (2785-2910)
  • QueryJobSizeSummary (2785-2787)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_runtime_status (162-162)
src/CraneCtld/DbClient.cpp (3)
src/Utilities/PublicHeader/PublicHeader.cpp (1)
  • result (772-772)
src/Utilities/PublicHeader/String.cpp (6)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • EfficientSplitTimeRange (624-642)
  • EfficientSplitTimeRange (624-625)
  • SetCurrentThreadName (354-362)
  • SetCurrentThreadName (354-354)
src/CraneCtld/DbClient.h (4)
  • string (71-73)
  • string (375-419)
  • string (377-393)
  • MongodbClient (93-221)
src/Utilities/PublicHeader/include/crane/String.h (1)
src/Utilities/PublicHeader/String.cpp (10)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • TryPushHours (603-621)
  • TryPushHours (603-604)
  • TryPushMonths (538-567)
  • TryPushMonths (538-539)
  • TryPushDays (570-600)
  • TryPushDays (570-571)
  • EfficientSplitTimeRange (624-642)
  • EfficientSplitTimeRange (624-625)
🔇 Additional comments (16)
scripts/wipe_data.sh (2)

72-80: ✓ Mode 3 correctly excluded from summary table wiping.

The conditions now properly restrict summary table wiping to modes 5, 7, and 8, excluding mode 3 (task_table). This aligns with the documented behavior.


82-85: Document mode 10 in usage message and validate input file.

Mode 10 is not listed in the usage message on line 3, making the script's full capability undiscoverable. Additionally, the hardcoded path ./tasks_bulk.json has no validation; if the file doesn't exist, mongoimport will fail. Consider either updating the usage message and adding file existence checks, or making the path configurable via an environment variable or argument.

[suggest_minor_issue]

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table | 10:import_task_bulk)"

And in the mode 10 block, add a file existence check:

 if [ "$mode" -eq 10 ]; then
+  if [ ! -f "./tasks_bulk.json" ]; then
+    echo "Error: ./tasks_bulk.json not found"
+    exit 1
+  fi
   batch_insert_collection_process "task_table" "./tasks_bulk.json"
   exit 0
 fi
src/Utilities/PublicHeader/include/crane/String.h (2)

42-46: LGTM: Clean TimeRange struct definition.

The struct is well-defined with clear field names and inline documentation of valid type values.


136-144: LGTM: Function declarations are properly structured.

All time-range utility functions are correctly declared in the header for implementation in the .cpp file.

src/Utilities/PublicHeader/String.cpp (2)

527-533: LGTM: Thread-safe time formatting.

Correctly uses localtime_r for thread safety and produces a standard timestamp format.


624-642: LGTM: Well-structured time-range splitting logic.

The greedy strategy (months → days → hours) is sound, and edge cases (empty range, partial final hour) are handled correctly.

src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)

2046-2054: LGTM - readiness check in place.

The server readiness check is correctly implemented, and the handler properly delegates to the database client for streaming query results.


2056-2068: LGTM - readiness check and conditional routing look good.

The handler correctly checks server readiness and appropriately routes to different query methods based on whether job IDs are specified.

src/CraneCtld/DbClient.cpp (8)

1535-1539: LGTM - nodename_list field properly integrated.

The nodename_list array is correctly constructed from the craned IDs in both TaskInEmbeddedDbToDocument_ and TaskInCtldToDocument_, and the field count has been properly updated to 40.

Also applies to: 1637-1641


1879-2044: Rollup orchestration is well-structured.

The rollup logic properly:

  • Uses $max operator to ensure timestamps only advance
  • Protects concurrent rollup operations with a mutex
  • Includes timing instrumentation for performance monitoring
  • Handles errors by updating the last success timestamp to the last completed interval

2070-2191: Hourly aggregation pipeline is correct.

The multiplication of nodes_alloc * cpus_alloc at lines 2100-2107 is intentional and correct (per previous discussion - cpus_alloc in the raw table represents per-node allocation). The pipeline stages properly compute totals, group by dimensions, and merge results into the summary collection.


2452-2462: Bucketing logic correctly fixed.

The bucketing logic now properly handles jobs with cpus_alloc smaller than the first threshold by assigning them to grouping_list[0] rather than dropping them. This addresses the concern raised in previous reviews.

Also applies to: 2892-2897


2624-2653: Type-safe helper functions for BSON field extraction.

The helper functions correctly handle multiple BSON numeric types (k_int32, k_int64, k_double) when extracting values, preventing type-related crashes.


2679-2910: Efficient multi-collection query with streaming responses.

Both QueryJobSummary and QueryJobSizeSummary implement sophisticated multi-collection aggregation by:

  • Classifying time ranges into hour/day/month segments
  • Using $unionWith to query appropriate summary tables
  • Streaming results in batches to handle large datasets
  • Proper exception handling

2939-2999: Proper index creation and timer initialization.

The initialization correctly:

  • Creates indexes on time fields and composite keys for efficient querying
  • Calculates the delay until the next hour boundary for the first rollup
  • Sets up a repeating timer at hourly intervals
  • Returns false if index creation fails

3001-3053: Thread lifecycle properly managed.

The background thread and destructor correctly:

  • Set a meaningful thread name for debugging
  • Check the stop flag periodically via idle handler
  • Walk and close all uvw handles before stopping the loop
  • Join the thread in the destructor with joinable check

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/Utilities/PublicHeader/String.cpp (1)

537-620: Require full-day/month alignment before widening time ranges

TryPushMonths/TryPushDays still accept timestamps where only the hour is zero; if the caller passes 00:30, these helpers treat the interval as an entire day/month, so downstream aggregations double-count or mislabel partial ranges. We previously discussed this exact bug—please only bucket when hour, minute, and second are all zero (and consider using a shared safe ToLocalTm helper).

-  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 &&
+      tm_cur.tm_sec == 0) {
@@
-  if (tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
🧹 Nitpick comments (1)
scripts/wipe_data.sh (1)

3-3: Fix usage message formatting for consistency.

The usage string still has inconsistent spacing around mode separators. For example, there's a space before 7: but the separator before 8: is missing a space, and mode descriptions lack uniform punctuation.

Apply this diff to improve consistency:

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table | 10:batch_insert_tasks)"

Note: Added mode 10 to the usage message since the script now supports it (line 82).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b89c440 and c3d6614.

📒 Files selected for processing (7)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/DbClient.h (5 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/Utilities/PublicHeader/String.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Utilities/PublicHeader/include/crane/String.h
  • src/Utilities/PublicHeader/String.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (5)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2684-2788)
  • QueryJobSummary (2684-2686)
  • QueryJobSizeSummary (2790-2915)
  • QueryJobSizeSummary (2790-2792)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_runtime_status (162-162)
src/Utilities/PublicHeader/include/crane/String.h (1)
src/Utilities/PublicHeader/String.cpp (10)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • TryPushHours (603-621)
  • TryPushHours (603-604)
  • TryPushMonths (538-567)
  • TryPushMonths (538-539)
  • TryPushDays (570-600)
  • TryPushDays (570-571)
  • EfficientSplitTimeRange (624-642)
  • EfficientSplitTimeRange (624-625)
src/CraneCtld/DbClient.h (1)
src/CraneCtld/DbClient.cpp (48)
  • MongodbClient (3029-3053)
  • MongodbClient (3055-3058)
  • Init (2981-3004)
  • Init (2981-2981)
  • InitTableIndexes (2944-2979)
  • InitTableIndexes (2944-2944)
  • CreateCollectionIndex (2051-2073)
  • CreateCollectionIndex (2051-2052)
  • RollupTypeToString (1866-1877)
  • RollupTypeToString (1866-1866)
  • UpdateSummaryLastSuccessTimeSec (1879-1913)
  • UpdateSummaryLastSuccessTimeSec (1879-1880)
  • GetSummaryLastSuccessTimeTm (1915-1962)
  • GetSummaryLastSuccessTimeTm (1915-1916)
  • NeedRollup (1964-1981)
  • NeedRollup (1964-1965)
  • RollupSummary (1983-2017)
  • RollupSummary (1983-1983)
  • ClusterRollupUsage (2019-2049)
  • ClusterRollupUsage (2019-2019)
  • GetCpusAlloc (2629-2638)
  • GetCpusAlloc (2629-2629)
  • GetTotalCpuTime (2640-2649)
  • GetTotalCpuTime (2640-2640)
  • GetTotalCount (2651-2658)
  • GetTotalCount (2651-2651)
  • WriteReply (2660-2682)
  • WriteReply (2660-2663)
  • QueryJobSizeSummaryByJobIds (2340-2492)
  • QueryJobSizeSummaryByJobIds (2340-2342)
  • JobQueryMatch (2494-2538)
  • JobQueryMatch (2494-2496)
  • QueryJobSummary (2684-2788)
  • QueryJobSummary (2684-2686)
  • JobSizeQueryMatch (2540-2604)
  • JobSizeQueryMatch (2540-2542)
  • QueryJobSizeSummary (2790-2915)
  • QueryJobSizeSummary (2790-2792)
  • AppendUnionWithRanges (2607-2627)
  • AppendUnionWithRanges (2607-2610)
  • AggregateJobSummaryByHour (2075-2196)
  • AggregateJobSummaryByHour (2075-2077)
  • AggregateJobSummaryByDayOrMonth (2198-2338)
  • AggregateJobSummaryByDayOrMonth (2198-2201)
  • MongoDbSumaryTh_ (3006-3027)
  • MongoDbSumaryTh_ (3006-3007)
  • MillisecondsToNextHour (2917-2942)
  • MillisecondsToNextHour (2917-2917)
src/CraneCtld/DbClient.cpp (4)
src/Utilities/PublicHeader/PublicHeader.cpp (1)
  • result (772-772)
src/Utilities/PublicHeader/String.cpp (6)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • EfficientSplitTimeRange (624-642)
  • EfficientSplitTimeRange (624-625)
  • SetCurrentThreadName (354-362)
  • SetCurrentThreadName (354-354)
src/CraneCtld/DbClient.h (3)
  • string (71-73)
  • string (377-419)
  • MongodbClient (93-221)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2054)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2056-2068)
  • QueryJobSizeSummary (2056-2059)
🔇 Additional comments (8)
scripts/wipe_data.sh (3)

33-47: New import functions look good.

The batch_insert_collection function correctly uses --jsonArray for importing multiple documents within a single JSON array, while batch_insert_collection_process handles standard newline-delimited JSON format. Both functions properly inherit authentication and database parameters from the script context.


72-80: ✓ Critical issue resolved: Mode 3 correctly excluded from summary table wipes.

The logic now correctly isolates summary table wiping to modes 5, 7, and 8. Mode 3 (task_table) no longer triggers unintended deletion of derived summary data, fixing the critical inconsistency flagged in the prior review.


82-85: Verify relative path usage for mode 10 import.

The hardcoded relative path ./tasks_bulk.json assumes the script is executed from a specific working directory. If invoked from elsewhere (e.g., via cron, systemd, or absolute path call), the import will fail silently or with an unclear error.

Consider one of the following approaches:

  1. Use an absolute path (e.g., $base_dir/tasks_bulk.json if tasks are colocated with configs)
  2. Make it configurable via an environment variable or command-line parameter
  3. Document the required working directory in a comment if the current relative path is intentional

Verify the intended usage pattern and update accordingly.

src/CraneCtld/DbClient.h (5)

22-22: LGTM!

The include is necessary for the new gRPC streaming method signatures added to the MongodbClient class.


107-127: LGTM!

The RollupType enum and summary data structures are well-designed:

  • Explicit underlying type for the enum ensures predictable size
  • JobSizeSummaryKey properly implements AbslHashValue and operator== for use in hash maps
  • JobSizeSummaryResult is a straightforward aggregate

129-131: LGTM!

The lifecycle methods are appropriately declared. The separation of Init() from the constructor is reasonable for complex initialization involving thread startup.


372-379: LGTM!

The collection name constants follow the existing naming convention, and the batch size constant is appropriately defined as a compile-time constant with a clear explanatory comment.


388-392: LGTM!

The thread synchronization members are properly designed:

  • cluster_rollup_mutex_ protects the rollup critical section
  • m_mongodb_sum_thread_ manages the background rollup thread lifecycle
  • m_thread_stop_ uses std::atomic_bool for safe cross-thread signaling

The destructor properly coordinates thread shutdown by setting the flag and joining the thread.

@1daidai1 1daidai1 force-pushed the dev/mongodbtest branch 2 times, most recently from fbe22d2 to 8b946f0 Compare November 11, 2025 09:24
@1daidai1 1daidai1 changed the title Add Creport Cmd Add creport cmd Nov 12, 2025
@1daidai1 1daidai1 marked this pull request as draft November 12, 2025 02:39
@1daidai1 1daidai1 force-pushed the dev/mongodbtest branch 3 times, most recently from 04e0fa1 to 3b8661e Compare November 12, 2025 08:34
@1daidai1 1daidai1 marked this pull request as ready for review November 12, 2025 08:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/CraneCtld/CraneCtld.cpp (1)

776-789: Shut down Mongo rollup thread cleanly (reset g_db_client in Destroy).

g_db_client owns a background uvw loop/thread; it’s never reset here, so the thread can outlive shutdown. Reset it to join the thread and close UV handles deterministically.

Apply:

 void DestroyCtldGlobalVariables() {
   using namespace Ctld;

+  // Stop Mongo rollup thread and close UV handles before tearing down others.
+  g_db_client.reset();
+
   g_task_scheduler.reset();
   g_craned_keeper.reset();

   // In case that spdlog is destructed before g_embedded_db_client->Close()
   // in which log function is called.
   g_embedded_db_client.reset();

   g_thread_pool->wait();
   g_thread_pool.reset();
   g_plugin_client.reset();
 }
src/CraneCtld/DbClient.cpp (1)

460-468: Bug: wrong field name in CheckTaskDbIdExisted

Filter uses "job_db_inx", but documents store "task_db_id". This check will never match.

Apply:

-bool MongodbClient::CheckTaskDbIdExisted(int64_t task_db_id) {
-  document doc;
-  doc.append(kvp("job_db_inx", task_db_id));
+bool MongodbClient::CheckTaskDbIdExisted(int64_t task_db_id) {
+  document doc;
+  doc.append(kvp("task_db_id", task_db_id));
♻️ Duplicate comments (5)
scripts/wipe_data.sh (1)

3-3: Fix usage message formatting inconsistency.

The usage string still has uneven spacing around pipes: user_table| 7: lacks a space before |, while month_table |8: has a space but inconsistent placement. Normalize to use consistent formatting throughout.

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table)"
src/Utilities/PublicHeader/String.cpp (3)

538-567: Critical granularity bug and missing error checks remain unaddressed.

A previous review flagged these issues, but they have not been fully resolved:

  1. Granularity bug (Line 542): Checking only tm_hour == 0 treats times like 00:30:00 as month boundaries, causing incorrect aggregation into month collections.
  2. Missing error check (Line 553): mktime can return -1 on error, but this is not checked.

Apply the fixes from the previous review:

-  std::tm tm_cur{};
-  localtime_r(&cur, &tm_cur);
-  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0) {
+  std::tm tm_cur{};
+  localtime_r(&cur, &tm_cur);
+  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
     std::time_t month_start = cur;
     std::time_t month_cur = cur;
     while (true) {
       std::tm tm_next_month{};
       localtime_r(&month_cur, &tm_next_month);
       tm_next_month.tm_mon += 1;
       tm_next_month.tm_mday = 1;
       tm_next_month.tm_hour = 0;
       tm_next_month.tm_min = 0;
       tm_next_month.tm_sec = 0;
       std::time_t next_month_first = mktime(&tm_next_month);
+      if (next_month_first == -1) break;
       if (next_month_first <= end) {

570-598: Critical granularity bug and missing error checks remain unaddressed.

The same issues flagged in the previous review apply here:

  1. Granularity bug (Line 574): Checking only tm_hour == 0 treats times like 00:30:00 as day boundaries, causing incorrect day aggregation.
  2. Missing error checks (Lines 582, 589): mktime can return -1 on error, but this is not checked.

Apply the fixes:

-  std::tm tm_cur{};
-  localtime_r(&cur, &tm_cur);
-  if (tm_cur.tm_hour == 0) {
+  std::tm tm_cur{};
+  localtime_r(&cur, &tm_cur);
+  if (tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
     std::time_t day_start = cur;
     std::tm tm_next_month = tm_cur;
     tm_next_month.tm_mon += 1;
     tm_next_month.tm_mday = 1;
     tm_next_month.tm_hour = 0;
     tm_next_month.tm_min = 0;
     tm_next_month.tm_sec = 0;
     std::time_t next_month_first = mktime(&tm_next_month);
+    if (next_month_first == -1) return false;

     std::tm tm_end{};
     localtime_r(&end, &tm_end);
     tm_end.tm_hour = 0;
     tm_end.tm_min = 0;
     tm_end.tm_sec = 0;
     std::time_t end_trunc_to_day = mktime(&tm_end);
+    if (end_trunc_to_day == -1) return false;

601-618: Missing error check for mktime.

Line 611: mktime can return -1 on error, but this is not checked, as flagged in the previous review.

Apply the fix:

     tm_next_day.tm_mday += 1;
     tm_next_day.tm_hour = 0;
     tm_next_day.tm_min = 0;
     tm_next_day.tm_sec = 0;
     std::time_t next_day = mktime(&tm_next_day);
+    if (next_day == -1) return false;
     std::time_t hour_end = std::min(next_day, end);
src/CraneCtld/DbClient.h (1)

158-203: Address previous review feedback.

The issues flagged in the previous review are still present:

  1. Typo on line 202: MongoDbSumaryTh_ should be MongoDbSummaryTh_ (missing 'm' in "Summary").

  2. Visibility concerns: Many methods in this section (lines 158-203) expose implementation details and should be moved to the private section. Only QueryJobSummary (lines 183-185) and QueryJobSizeSummary (lines 189-191) are called from CtldGrpcServer and need to remain public.

🧹 Nitpick comments (4)
protos/Crane.proto (1)

977-1006: Clarify time-bound semantics in doc comments

In summaries, end time is treated as exclusive ([start, end)); in per-ID path it’s inclusive ($lte). Document the intended contract to avoid boundary mismatches for clients.

Would you like me to add brief proto comments indicating whether filter_end_time is inclusive or exclusive for each RPC?

src/CraneCtld/DbClient.cpp (2)

2942-2977: Add raw-table indexes for new filters (perf)

QueryJobSizeSummaryByJobIds now filters on account/user/qos/partition/wckey/nodename_list/time_start/time_end without supporting indexes on the raw job table. Add them to avoid collection scans at scale.

Apply:

 bool MongodbClient::InitTableIndexes() {
   try {
     // Create index for the raw task table
     auto raw_table = (*GetClient_())[m_db_name_][m_task_collection_name_];
-    CreateCollectionIndex(raw_table, {"time_end"});
+    CreateCollectionIndex(raw_table, {"time_end"});
+    CreateCollectionIndex(raw_table, {"time_start"});
+    CreateCollectionIndex(raw_table, {"account"});
+    CreateCollectionIndex(raw_table, {"username"});
+    CreateCollectionIndex(raw_table, {"qos"});
+    CreateCollectionIndex(raw_table, {"partition_name"});
+    CreateCollectionIndex(raw_table, {"wckey"});
+    // nodename_list is an array -> multi-key index for $in on node names
+    CreateCollectionIndex(raw_table, {"nodename_list"});

2758-2762: Bound Mongo aggregation duration with maxTimeMS

Long-running aggregations can block CQ threads. Set a reasonable max_time to fail fast under load spikes.

Apply:

-    mongocxx::options::aggregate agg_opts;
+    mongocxx::options::aggregate agg_opts;
     agg_opts.allow_disk_use(true);
+    agg_opts.max_time(std::chrono::minutes(2));

Do this in both QueryJobSummary and QueryJobSizeSummary aggregate paths.

Also applies to: 2875-2878

src/CraneCtld/DbClient.h (1)

388-393: Consider removing redundant access specifier.

Line 391 contains a redundant private: label, as the private section already began at line 317. While not incorrect, removing it would improve clarity.

Apply this diff to remove the redundancy:

   std::mutex cluster_rollup_mutex_;
   std::thread m_mongodb_sum_thread_;
 
- private:
   std::atomic_bool m_thread_stop_{false};
   std::shared_ptr<uvw::timer_handle> mongodb_task_timer_handle;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3d6614 and 3b8661e.

📒 Files selected for processing (9)
  • protos/Crane.proto (2 hunks)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/DbClient.h (5 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
  • src/Utilities/PublicHeader/String.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/Utilities/PublicHeader/include/crane/String.h
  • src/CraneCtld/RpcService/CtldGrpcServer.h
🧰 Additional context used
🧠 Learnings (11)
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Utilities/PublicHeader/String.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/DbClient.h
🧬 Code graph analysis (4)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
src/CraneCtld/DbClient.cpp (3)
src/Utilities/PublicHeader/String.cpp (4)
  • FormatTime (527-533)
  • FormatTime (527-527)
  • EfficientSplitTimeRange (621-639)
  • EfficientSplitTimeRange (621-622)
src/CraneCtld/DbClient.h (3)
  • string (71-73)
  • string (377-420)
  • MongodbClient (93-221)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2054)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2056-2068)
  • QueryJobSizeSummary (2056-2059)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2682-2786)
  • QueryJobSummary (2682-2684)
  • QueryJobSizeSummary (2788-2913)
  • QueryJobSizeSummary (2788-2790)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_runtime_status (162-162)
src/CraneCtld/DbClient.h (2)
src/CraneCtld/DbClient.cpp (48)
  • MongodbClient (3027-3051)
  • MongodbClient (3053-3056)
  • Init (2979-3002)
  • Init (2979-2979)
  • InitTableIndexes (2942-2977)
  • InitTableIndexes (2942-2942)
  • CreateCollectionIndex (2051-2073)
  • CreateCollectionIndex (2051-2052)
  • RollupTypeToString (1866-1877)
  • RollupTypeToString (1866-1866)
  • UpdateSummaryLastSuccessTimeSec (1879-1913)
  • UpdateSummaryLastSuccessTimeSec (1879-1880)
  • GetSummaryLastSuccessTimeTm (1915-1962)
  • GetSummaryLastSuccessTimeTm (1915-1916)
  • NeedRollup (1964-1981)
  • NeedRollup (1964-1965)
  • RollupSummary (1983-2017)
  • RollupSummary (1983-1983)
  • ClusterRollupUsage (2019-2049)
  • ClusterRollupUsage (2019-2019)
  • GetCpusAlloc (2627-2636)
  • GetCpusAlloc (2627-2627)
  • GetTotalCpuTime (2638-2647)
  • GetTotalCpuTime (2638-2638)
  • GetTotalCount (2649-2656)
  • GetTotalCount (2649-2649)
  • WriteReply (2658-2680)
  • WriteReply (2658-2661)
  • QueryJobSizeSummaryByJobIds (2338-2490)
  • QueryJobSizeSummaryByJobIds (2338-2340)
  • JobQueryMatch (2492-2536)
  • JobQueryMatch (2492-2494)
  • QueryJobSummary (2682-2786)
  • QueryJobSummary (2682-2684)
  • JobSizeQueryMatch (2538-2602)
  • JobSizeQueryMatch (2538-2540)
  • QueryJobSizeSummary (2788-2913)
  • QueryJobSizeSummary (2788-2790)
  • AppendUnionWithRanges (2605-2625)
  • AppendUnionWithRanges (2605-2608)
  • AggregateJobSummaryByHour (2075-2195)
  • AggregateJobSummaryByHour (2075-2077)
  • AggregateJobSummaryByDayOrMonth (2197-2336)
  • AggregateJobSummaryByDayOrMonth (2197-2200)
  • MongoDbSumaryTh_ (3004-3025)
  • MongoDbSumaryTh_ (3004-3005)
  • MillisecondsToNextHour (2915-2940)
  • MillisecondsToNextHour (2915-2915)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (4)
  • QueryJobSummary (2046-2054)
  • QueryJobSummary (2046-2049)
  • QueryJobSizeSummary (2056-2068)
  • QueryJobSizeSummary (2056-2059)
🔇 Additional comments (12)
scripts/wipe_data.sh (3)

33-47: New batch insertion functions look good.

The two new helper functions cleanly separate array-based imports (--jsonArray) from single-document imports. Clear naming and consistent parameter handling.


72-76: Mode 3 correctly removed from summary table wiping.

The summary table wiping conditions now only trigger for modes 5 and 7, aligning with documented behavior.


78-80: Mode 3 correctly removed from summary_time_table wiping.

The summary_time_table wiping now only triggers for modes 5 and 8, as intended.

src/Utilities/PublicHeader/String.cpp (2)

527-533: LGTM!

The function correctly uses localtime_r (thread-safe on POSIX) and std::strftime for formatting.


621-639: Logic is sound, assuming helper functions are fixed.

The orchestration logic correctly tries months, then days, then hours, with a fallback for partial intervals. Once the critical bugs in TryPushMonths, TryPushDays, and TryPushHours are addressed, this function should work correctly.

src/CraneCtld/CraneCtld.cpp (1)

884-889: Good: gate readiness on DB Init()

Init() result is now checked before setting srv_ready. This prevents RPCs against a half-initialized DB client.

src/CraneCtld/DbClient.cpp (3)

1535-1567: Persist nodename_list in job docs

Capturing craned_ids into nodename_list and carrying it into summaries looks correct and enables node-based filters.

Also applies to: 1578-1615


2051-2185: Note on summary time window and fields

The hour rollup matches jobs whose time_end ∈ [hour, hour+3600) and groups on cpus_alloc, partition_name, and nodename_list. This is consistent; just ensure clients expect “bucketed by end time”.

Do you want me to add a short comment in code to lock in this invariant?


2984-3001: Timer lifetime fix looks good

Storing uvw timer in a member keeps the handle alive past Init(); destructor joins the thread — safe cleanup path.

src/CraneCtld/DbClient.h (3)

107-127: LGTM!

The RollupType enum and data structures are well-defined. The JobSizeSummaryKey struct correctly implements both hash function (via AbslHashValue) and equality operator for use in hash-based containers, and JobSizeSummaryResult properly initializes its members.


129-131: LGTM!

Constructor, destructor, and Init() method declarations are appropriately placed in the public interface.


372-379: LGTM!

The collection name constants and MaxJobSummaryBatchSize are properly defined and follow the existing naming conventions.

@github-actions github-actions bot added the test-passed Build and test success label Nov 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/zh/command/creport.md (1)

30-30: Minor: Chinese character consistency in account terminology.

Lines 30 both use "帐户" (variant for "account"), while standard modern Chinese documentation typically uses "账户" consistently. Consider standardizing to "账户" throughout for consistency with other documentation files.

This is a minor style preference issue. Both forms are technically valid, but "账户" is the more commonly standardized form in modern Chinese technical documentation.

Also applies to: 30-30

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b8661e and 0e4afaa.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (13)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • docs/en/command/cqueue.md
  • docs/en/command/cbatch.md
  • docs/zh/command/cacct.md
  • docs/zh/command/cqueue.md
🧰 Additional context used
🪛 LanguageTool
docs/en/command/creport.md

[grammar] ~20-~20: Ensure spelling is correct
Context: ...ER**: Cluster name - LOGIN: Username - PORPER NAME: Full Linux system name - **ACCO...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~54-~54: Ensure spelling is correct
Context: ...NT**: Account name - LOGIN: Username - PORPER NAME: Full Linux system name - **USED...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~85-~85: Ensure spelling is correct
Context: ...ER**: Cluster name - LOGIN: Username - PORPER NAME: Full Linux system name - **ACCO...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~117-~117: Ensure spelling is correct
Context: ...ER**: Cluster name - LOGIN: Username - PORPER NAME: Full Linux system name - **WCKE...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~148-~148: Ensure spelling is correct
Context: ...WCKEY**:wckey name - LOGIN: Username - PORPER NAME: Full Linux system name - **USED...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/zh/command/creport.md

[uncategorized] ~30-~30: 您的意思是“"账"户”?
Context: ...定多个用户时用逗号隔开 - --group bool: 将每个用户的所有帐户分组在一起(默认为每个用户和帐户引用的单独条目) - **-t, --time...

(ZHANG7_ZHANG8)


[uncategorized] ~30-~30: 您的意思是“"账"户”?
Context: ...--group bool**: 将每个用户的所有帐户分组在一起(默认为每个用户和帐户引用的单独条目) - -t, --time string: 指定输出作...

(ZHANG7_ZHANG8)

🔇 Additional comments (8)
docs/en/command/ccancel.md (1)

212-212: LGTM!

The addition of the creport reference is consistent with the "See Also" section structure and maintains documentation parity with other command references.

docs/zh/command/ceff.md (1)

95-95: LGTM!

The addition of the creport reference is consistent with the "相关命令" (Related Commands) section and maintains documentation parity with other entries.

docs/en/command/cacct.md (1)

277-277: LGTM!

The addition of the creport reference is consistent with the "See Also" section and properly positioned alongside other command references.

docs/zh/command/ccancel.md (1)

212-212: LGTM!

The addition of the creport reference is consistent with the "另请参阅" (See Also) section structure and maintains documentation parity with other command references.

mkdocs.yaml (1)

167-167: LGTM!

The creport navigation entry is properly formatted, correctly placed under the "Job Monitoring" section, and follows YAML syntax conventions established in the file.

docs/zh/command/cbatch.md (1)

338-338: LGTM!

The addition of the creport reference is consistent with the "另请参阅" (See Also) section and maintains documentation parity with other related command references.

docs/en/command/ceff.md (1)

95-95: LGTM!

The addition of the creport reference is consistent with the "Related Commands" section and properly positioned with other command references.

docs/zh/command/creport.md (1)

1-361: Comprehensive documentation structure approved.

The new creport documentation is well-organized, covering multiple subcommands with consistent output fields, options, and usage examples. Documentation structure aligns with existing command docs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e4afaa and d26ab58.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (13)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/en/command/cqueue.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • docs/en/command/cacct.md
  • docs/zh/command/cacct.md
  • docs/en/command/creport.md
  • docs/en/command/cbatch.md
  • docs/zh/command/cqueue.md
  • docs/zh/command/ccancel.md
🧰 Additional context used
🪛 LanguageTool
docs/zh/command/creport.md

[uncategorized] ~30-~30: 您的意思是“"账"户”?
Context: ...定多个用户时用逗号隔开 - --group bool: 将每个用户的所有帐户分组在一起(默认为每个用户和帐户引用的单独条目) - **-t, --time...

(ZHANG7_ZHANG8)


[uncategorized] ~30-~30: 您的意思是“"账"户”?
Context: ...--group bool**: 将每个用户的所有帐户分组在一起(默认为每个用户和帐户引用的单独条目) - -t, --time string: 指定输出作...

(ZHANG7_ZHANG8)

🪛 markdownlint-cli2 (0.18.1)
docs/zh/command/creport.md

13-13: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


42-42: Multiple headings with the same content

(MD024, no-duplicate-heading)


46-46: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


78-78: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


110-110: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


140-140: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


170-170: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


201-201: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


230-230: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


270-270: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


310-310: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


349-349: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (5)
docs/en/command/ccancel.md (1)

212-212: LGTM!

The cross-reference to creport is properly placed in the "See Also" section and follows the existing documentation pattern.

docs/zh/command/ceff.md (1)

95-95: LGTM!

The creport reference is properly added to the "相关命令" section, maintaining consistency with the English documentation and existing link patterns.

docs/en/command/ceff.md (1)

95-95: LGTM!

The creport reference is properly added to the "Related Commands" section, maintaining consistency with the documentation structure and link patterns.

mkdocs.yaml (1)

167-167: LGTM!

The creport navigation entry is correctly placed under the "Job Monitoring" section with proper YAML syntax and indentation. The file path is consistent with the documentation structure.

docs/zh/command/cbatch.md (1)

338-338: LGTM!

The creport reference is properly added to the "另请参阅" section, maintaining consistency with the cbatch documentation structure and link patterns.

@1daidai1 1daidai1 force-pushed the dev/mongodbtest branch 2 times, most recently from cebb70e to 14bd0f3 Compare November 18, 2025 02:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
scripts/wipe_data.sh (1)

3-3: Fix usage message formatting (formatting issue persists from previous review).

The usage string still has inconsistent spacing and formatting despite being marked as addressed in commit fbe22d2:

  • After mode 6, missing space before pipe: user_table| 7:user_table | 7:
  • Before mode 8, incorrect spacing: |8: summary| 8:summary

Apply this diff to achieve consistent formatting across all modes:

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table)"
🧹 Nitpick comments (1)
scripts/wipe_data.sh (1)

33-39: Remove unused batch_insert_collection() function.

This function is defined but never called in the script. Line 83 uses batch_insert_collection_process() instead. If this function is intended for future use, document it; otherwise, remove it to avoid dead code.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d26ab58 and 14bd0f3.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (14)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
  • scripts/wipe_data.sh (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mkdocs.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
  • docs/en/command/cacct.md
  • docs/en/command/ccancel.md
  • docs/en/command/ceff.md
  • docs/zh/command/cbatch.md
  • docs/zh/command/ceff.md
  • docs/zh/command/ccancel.md
  • docs/en/command/cbatch.md
  • docs/en/command/creport.md
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
🪛 markdownlint-cli2 (0.18.1)
docs/zh/command/creport.md

349-349: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (7)
scripts/wipe_data.sh (4)

41-47: batch_insert_collection_process() implementation looks correct.

The function properly imports newline-delimited JSON (omits --jsonArray), distinct from the unused array variant above. The credential passing and database selection are correct.


72-76: Mode 7 summary table wiping logic is correct.

Conditions properly restrict to modes 5 and 7 only, eliminating the unintended wipe of summary tables when mode 3 (task_table) is selected. This correctly addresses the previous critical issue.


78-80: Mode 8 summary_time_table wiping logic is correct.

Conditions properly restrict to modes 5 and 8 only. The fix from the previous review that removed mode 3 is correctly in place.


82-89: Mode 10 batch insert with error handling is properly implemented.

The script now correctly captures the exit status of batch_insert_collection_process, logs an error on failure, and exits with status 1. This addresses the previous critical issue where import failures were silently masked.

docs/en/command/cqueue.md (1)

213-213: LGTM! The new creport link follows the established pattern and correctly references the related command documentation.

docs/zh/command/cqueue.md (1)

213-213: LGTM! The new creport link follows the established pattern and correctly references the related command documentation.

docs/zh/command/cacct.md (1)

277-277: LGTM! The new creport link follows the established pattern and correctly references the related command documentation.

@1daidai1 1daidai1 force-pushed the dev/mongodbtest branch 2 times, most recently from 534afe7 to 656b876 Compare November 18, 2025 07:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
scripts/wipe_data.sh (1)

33-47: Clarify or consolidate nearly-identical batch insert functions.

Both functions are duplicates except for the --jsonArray flag. The naming distinction (_process suffix) doesn't clearly indicate that one expects an array JSON file and the other expects line-delimited JSON (JSONL). Additionally, batch_insert_collection (lines 33–39) appears unused—only batch_insert_collection_process is called in mode 10.

Consider renaming for clarity (e.g., batch_insert_json_array vs. batch_insert_jsonl), or consolidate them into a single function with an optional parameter if only one format is needed. If batch_insert_collection is genuinely unused, remove it to avoid maintenance overhead.

docs/zh/command/creport.md (1)

289-289: Add missing space after colon in list items.

Lines 289 and 330 lack a space between the colon and description text, breaking the formatting pattern established throughout the file. Compare with line 329 which correctly has the space:

-- **-w --wckeys string**:指定查询的wckeys,指定多个wckey时用逗号隔开
+- **-w --wckeys string**: 指定查询的wckeys,指定多个wckey时用逗号隔开

Apply the same fix to both line 289 and 330.

Also applies to: 330-330

docs/en/command/creport.md (1)

289-289: Add missing space after colon in list items.

Lines 289 and 330 lack a space between the colon and description text, breaking the formatting pattern established throughout the file. Compare with surrounding lines which correctly have the space:

-- **-w --wckeys string**:Specify WCKEYs to query, use commas to separate multiple WCKEYs
+- **-w --wckeys string**: Specify WCKEYs to query, use commas to separate multiple WCKEYs

Apply the same fix to both line 289 and 330 for consistency.

Also applies to: 330-330

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14bd0f3 and 656b876.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (14)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
  • scripts/wipe_data.sh (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/zh/command/cbatch.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • docs/en/command/cacct.md
  • docs/zh/command/ccancel.md
  • mkdocs.yaml
  • docs/zh/command/cqueue.md
  • docs/en/command/cqueue.md
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
🪛 markdownlint-cli2 (0.18.1)
docs/zh/command/creport.md

349-349: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (7)
scripts/wipe_data.sh (2)

72-80: Correct application of mode-specific summary table wiping logic.

The removal of mode 3 from the summary table conditions (lines 72 and 78) correctly aligns the implementation with the documented behavior in the usage message. Mode 3 now exclusively wipes task_table as documented, and modes 7 and 8 properly isolate their respective summary table subsets.


82-89: Error handling for mode 10 import is properly implemented.

The script now correctly captures the exit status of batch_insert_collection_process, logs an informative error message on failure, and exits with a non-zero status. This properly propagates import failures to callers and CI systems, resolving the earlier flagged issue.

docs/en/command/ccancel.md (1)

212-212: LGTM!

Cross-reference addition is correctly formatted and placed.

docs/en/command/ceff.md (1)

95-95: LGTM!

Cross-reference addition is correctly formatted and placed.

docs/zh/command/cacct.md (1)

277-277: LGTM!

Cross-reference addition is correctly formatted and placed.

docs/zh/command/ceff.md (1)

95-95: LGTM!

Cross-reference addition is correctly formatted and placed.

docs/en/command/cbatch.md (1)

338-338: LGTM!

Cross-reference addition is correctly formatted and placed.

@1daidai1 1daidai1 force-pushed the dev/mongodbtest branch 5 times, most recently from 2ea5ffb to 5a0f4e7 Compare November 18, 2025 08:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
scripts/wipe_data.sh (2)

33-47: Consolidate duplicate batch insert functions.

batch_insert_collection() and batch_insert_collection_process() are nearly identical, differing only by the --jsonArray flag. Consolidate into one function with a parameter to control this behavior:

-function batch_insert_collection() {
-  collection=$1
-  json_file=$2
-  mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
-    --authenticationDatabase admin --db "$dbname" --collection "$collection" \
-    --file "$json_file" --jsonArray
-}
-
-function batch_insert_collection_process() {
-  collection=$1
-  json_file=$2
-  mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
-    --authenticationDatabase admin --db "$dbname" --collection "$collection" \
-    --file "$json_file"
-}
+function batch_insert_collection() {
+  collection=$1
+  json_file=$2
+  is_array_format=${3:-0}  # Default to 0 (JSONL format)
+  
+  local extra_args=""
+  if [ "$is_array_format" -eq 1 ]; then
+    extra_args="--jsonArray"
+  fi
+  
+  mongoimport --username "$username" --password "$password" --host "$host" --port "$port" \
+    --authenticationDatabase admin --db "$dbname" --collection "$collection" \
+    --file "$json_file" $extra_args
+}

Then update line 83 to batch_insert_collection "task_table" "./tasks_bulk.json" 0.


82-89: Verify hardcoded path portability and consider making it configurable.

Line 83 uses a hardcoded path ./tasks_bulk.json. This may not work reliably depending on the working directory from which the script is invoked. Consider making it a parameter or deriving it from a configuration variable (e.g., $base_dir):

 if [ "$mode" -eq 10 ]; then
-  batch_insert_collection_process "task_table" "./tasks_bulk.json"
+  tasks_file="${1:-.}/tasks_bulk.json"  # Allow override via second arg, default to ./
+  batch_insert_collection_process "task_table" "$tasks_file"

Alternatively, document the expected working directory or file location in the usage message.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 656b876 and 5a0f4e7.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (14)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
  • scripts/wipe_data.sh (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/en/command/ceff.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • docs/en/command/cacct.md
  • docs/en/command/cqueue.md
  • docs/en/command/ccancel.md
  • docs/zh/command/ccancel.md
  • docs/zh/command/cqueue.md
  • docs/en/command/cbatch.md
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (49-57)
🪛 markdownlint-cli2 (0.18.1)
docs/zh/command/creport.md

349-349: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (5)
docs/zh/command/cacct.md (1)

277-277: LGTM!

Cross-reference link is properly formatted and contextually appropriate within the "另请参阅" section.

docs/zh/command/ceff.md (1)

95-95: LGTM!

Cross-reference link is properly formatted and contextually appropriate within the "相关命令" section.

docs/zh/command/cbatch.md (1)

338-338: LGTM!

Cross-reference link is properly formatted and contextually appropriate within the "另请参阅" section.

mkdocs.yaml (1)

167-167: LGTM!

Navigation entry correctly placed under "Job Monitoring" section with consistent formatting and file path structure.

docs/zh/command/creport.md (1)

1-361: LGTM!

The Chinese documentation is comprehensive and well-structured. All major formatting issues from previous reviews (MD036 headings, terminology consistency, markdown list formatting) have been properly addressed. The content clearly documents all creport subcommands with appropriate sections for usage, options, and examples.

#!/bin/bash

echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table)"
echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Update usage message: fix formatting and document mode 10.

The usage message has two issues:

  1. Inconsistent spacing around pipes and colons (compare | 6:...| 7: vs. |8: summary_time_table)
  2. Mode 10 (task import) is implemented but undocumented in the usage string
-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table | 10:import_tasks)"
🤖 Prompt for AI Agents
In scripts/wipe_data.sh around line 3, the usage message has inconsistent
spacing around pipes and colons and omits the implemented mode 10; update the
echo string to use consistent spacing (space before and after each pipe and
after each colon) and add a clear entry for "10: task_import" (or the exact mode
name used in the script) so the usage lists modes 1 through 10 consistently and
readably.

@github-actions github-actions bot added docs and removed test-passed Build and test success labels Nov 27, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
scripts/wipe_data.sh (1)

3-3: Document mode 10 in the usage message.

The usage message lists modes 1-8, but mode 10 (task import) is implemented at lines 82-89 and is not documented here. Users won't know this mode exists.

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table | 10:import_tasks)"
src/CraneCtld/DbClient.h (1)

202-202: Typo in method name.

MongoDbSumaryTh_ should be MongoDbSummaryTh_ (missing 'm' in "Summary").

Apply this diff:

-  void MongoDbSumaryTh_(const std::shared_ptr<uvw::loop>& uvw_loop);
+  void MongoDbSummaryTh_(const std::shared_ptr<uvw::loop>& uvw_loop);

Remember to update the implementation in DbClient.cpp as well.

src/Utilities/PublicHeader/String.cpp (1)

544-656: Granularity alignment for month/day ranges is still off at 00:mm:ss starts.

The switch to localtime_r is good for thread‑safety, but the earlier “granularity alignment” issue remains: you still treat any time at tm_hour == 0 (and tm_mday == 1 for months) as a full day/month, regardless of minutes/seconds.

Concretely, for ranges like:

  • start = 2025‑05‑01 00:05, end = 2025‑07‑03 12:34
    • TryPushMonths will currently accept this as a "month" segment starting at 00:05 if end is far enough ahead.
  • start = 2025‑05‑01 00:05, end = 2025‑05‑03 00:00
    • TryPushDays will produce a "day" segment from 00:05 on the 1st through 00:00 on the 3rd.

In both cases, the first partial day/hour is labeled as a "day"/"month" interval even though it doesn’t start exactly at midnight. This is risky if you’re using these labels to decide which MongoDB rollup collection (month/day/hour) to query; you can end up querying a coarser collection that assumes [00:00, 00:00) alignment and miscounting the leading minutes/seconds.

Suggested fix (same spirit as the previous review):

  • Tighten the preconditions before merging months/days, e.g.:
-  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_mday == 1 && tm_cur.tm_hour == 0 &&
+      tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
-  if (tm_cur.tm_hour == 0) {
+  if (tm_cur.tm_hour == 0 && tm_cur.tm_min == 0 && tm_cur.tm_sec == 0) {
  • Optionally also check return values from localtime_r/mktime and early‑return false if they fail, to avoid undefined behavior on out‑of‑range timestamps or pathological DST transitions.

This keeps the “month”/“day” buckets strictly aligned to exact local midnights, and any non‑aligned range start will naturally fall back to "hour" segments.

🧹 Nitpick comments (1)
src/Utilities/PublicHeader/include/crane/String.h (1)

42-47: Narrow public API surface for time ranges; optionally strengthen type.

TimeRange and EfficientSplitTimeRange make sense as public utilities, but the TryPushMonths/Days/Hours helpers look like purely internal orchestration details that are only used in String.cpp. Exposing them in the header grows the public API surface and makes it harder to change the splitting strategy later.

Consider:

  • Keeping only TimeRange and EfficientSplitTimeRange in the header, and moving TryPushMonths/Days/Hours to an anonymous namespace in String.cpp (or making them static), unless you explicitly intend them to be part of the stable API.
  • Optionally replacing the std::string type discriminator ("month"/"day"/"hour") with an enum class (and converting to string only at the boundary where you serialize or log) to avoid magic strings and improve type safety.
  • Optionally including <vector> and <ctime> here so the header does not rely on transitive includes for std::vector and std::time_t.

Also applies to: 138-146

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1b19b6 and e164202.

⛔ Files ignored due to path filters (10)
  • docs/images/creport/accountutilizationbyqos.png is excluded by !**/*.png
  • docs/images/creport/accountutilizationbyuser.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccount.png is excluded by !**/*.png
  • docs/images/creport/sizesbyaccountandwckey.png is excluded by !**/*.png
  • docs/images/creport/sizesbywckey.png is excluded by !**/*.png
  • docs/images/creport/topusr.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbyaccount.png is excluded by !**/*.png
  • docs/images/creport/userutilizationbywckey.png is excluded by !**/*.png
  • docs/images/creport/utilization.png is excluded by !**/*.png
  • docs/images/creport/wckeyutilizationbyuser.png is excluded by !**/*.png
📒 Files selected for processing (24)
  • docs/en/command/cacct.md (1 hunks)
  • docs/en/command/cbatch.md (1 hunks)
  • docs/en/command/ccancel.md (1 hunks)
  • docs/en/command/ceff.md (1 hunks)
  • docs/en/command/cqueue.md (1 hunks)
  • docs/en/command/creport.md (1 hunks)
  • docs/zh/command/cacct.md (1 hunks)
  • docs/zh/command/cbatch.md (1 hunks)
  • docs/zh/command/ccancel.md (1 hunks)
  • docs/zh/command/ceff.md (1 hunks)
  • docs/zh/command/cqueue.md (1 hunks)
  • docs/zh/command/creport.md (1 hunks)
  • mkdocs.yaml (1 hunks)
  • protos/Crane.proto (2 hunks)
  • protos/PublicDefs.proto (1 hunks)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CtldPreCompiledHeader.h (1 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/DbClient.h (5 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.h (1 hunks)
  • src/Utilities/PublicHeader/String.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • docs/en/command/cqueue.md
  • docs/zh/command/ccancel.md
🚧 Files skipped from review as they are similar to previous changes (9)
  • docs/en/command/creport.md
  • docs/zh/command/cbatch.md
  • protos/PublicDefs.proto
  • docs/zh/command/ceff.md
  • src/CraneCtld/CtldPreCompiledHeader.h
  • docs/zh/command/cacct.md
  • docs/en/command/ceff.md
  • src/CraneCtld/CraneCtld.cpp
  • docs/zh/command/creport.md
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.h
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.h
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/String.h
  • src/CraneCtld/DbClient.h
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/DbClient.h
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (6)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2712-2816)
  • QueryJobSummary (2712-2714)
  • QueryJobSizeSummary (2818-2943)
  • QueryJobSizeSummary (2818-2820)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_runtime_status (166-166)
src/Utilities/PublicHeader/include/crane/String.h (1)
src/Utilities/PublicHeader/String.cpp (10)
  • FormatTime (544-550)
  • FormatTime (544-544)
  • TryPushHours (618-635)
  • TryPushHours (618-619)
  • TryPushMonths (555-584)
  • TryPushMonths (555-556)
  • TryPushDays (587-615)
  • TryPushDays (587-588)
  • EfficientSplitTimeRange (638-656)
  • EfficientSplitTimeRange (638-639)
src/CraneCtld/RpcService/CtldGrpcServer.h (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (6)
  • QueryJobSummary (2058-2066)
  • QueryJobSummary (2058-2061)
  • QueryJobSizeSummary (2068-2080)
  • QueryJobSizeSummary (2068-2071)
  • ActiveAggregationManually (2082-2097)
  • ActiveAggregationManually (2082-2085)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (51-59)
src/CraneCtld/DbClient.h (1)
src/CraneCtld/DbClient.cpp (48)
  • MongodbClient (3057-3081)
  • MongodbClient (3083-3086)
  • Init (3009-3032)
  • Init (3009-3009)
  • InitTableIndexes (2972-3007)
  • InitTableIndexes (2972-2972)
  • CreateCollectionIndex (2081-2103)
  • CreateCollectionIndex (2081-2082)
  • RollupTypeToString (1896-1907)
  • RollupTypeToString (1896-1896)
  • UpdateSummaryLastSuccessTimeSec (1909-1943)
  • UpdateSummaryLastSuccessTimeSec (1909-1910)
  • GetSummaryLastSuccessTimeTm (1945-1992)
  • GetSummaryLastSuccessTimeTm (1945-1946)
  • NeedRollup (1994-2011)
  • NeedRollup (1994-1995)
  • RollupSummary (2013-2047)
  • RollupSummary (2013-2013)
  • ClusterRollupUsage (2049-2079)
  • ClusterRollupUsage (2049-2049)
  • GetCpusAlloc (2657-2666)
  • GetCpusAlloc (2657-2657)
  • GetTotalCpuTime (2668-2677)
  • GetTotalCpuTime (2668-2668)
  • GetTotalCount (2679-2686)
  • GetTotalCount (2679-2679)
  • WriteReply (2688-2710)
  • WriteReply (2688-2691)
  • QueryJobSizeSummaryByJobIds (2368-2520)
  • QueryJobSizeSummaryByJobIds (2368-2370)
  • JobQueryMatch (2522-2566)
  • JobQueryMatch (2522-2524)
  • QueryJobSummary (2712-2816)
  • QueryJobSummary (2712-2714)
  • JobSizeQueryMatch (2568-2632)
  • JobSizeQueryMatch (2568-2570)
  • QueryJobSizeSummary (2818-2943)
  • QueryJobSizeSummary (2818-2820)
  • AppendUnionWithRanges (2635-2655)
  • AppendUnionWithRanges (2635-2638)
  • AggregateJobSummaryByHour (2105-2225)
  • AggregateJobSummaryByHour (2105-2107)
  • AggregateJobSummaryByDayOrMonth (2227-2366)
  • AggregateJobSummaryByDayOrMonth (2227-2230)
  • MongoDbSumaryTh_ (3034-3055)
  • MongoDbSumaryTh_ (3034-3035)
  • MillisecondsToNextHour (2945-2970)
  • MillisecondsToNextHour (2945-2945)
src/CraneCtld/DbClient.cpp (1)
src/CraneCtld/DbClient.h (3)
  • string (71-73)
  • string (377-426)
  • MongodbClient (93-221)
🪛 YAMLlint (1.37.1)
mkdocs.yaml

[error] 190-190: duplication of key "nav" in mapping

(key-duplicates)

🔇 Additional comments (27)
docs/en/command/ccancel.md (1)

212-212: LGTM! Documentation cross-reference added correctly.

The addition of the creport reference in the See Also section is consistent with the PR's objective to introduce creport documentation and integrate it across command references.

scripts/wipe_data.sh (2)

33-47: LGTM! Batch import functions implemented correctly.

The two batch import functions properly distinguish between array-format JSON (batch_insert_collection using --jsonArray) and line-delimited JSON (batch_insert_collection_process), which aligns with standard mongoimport usage patterns.


82-89: LGTM! Mode 10 implementation includes proper error handling.

The batch import for mode 10 correctly checks the exit status and propagates failures, addressing previous review concerns about silent failures.

docs/en/command/cacct.md (1)

277-277: LGTM! Documentation cross-reference added.

docs/en/command/cbatch.md (1)

339-339: LGTM! Documentation cross-reference added.

src/CraneCtld/RpcService/CtldGrpcServer.h (1)

384-397: LGTM! RPC method declarations are correctly defined.

The three new streaming and unary RPC methods follow the established patterns in this service implementation:

  • Proper use of ServerWriter<> for streaming responses
  • Consistent parameter naming and types
  • Correct override specifiers
src/CraneCtld/RpcService/CtldGrpcServer.cpp (3)

2058-2066: LGTM! QueryJobSummary correctly implements readiness check and delegation.

The handler properly checks server readiness and delegates to the database client. Per CraneSched conventions, returning Status::OK with errors communicated in the payload is correct.

If reporting should be restricted to admin users (similar to cacctmgr operations), consider adding CheckCertAndUIDAllowed_ and CheckUidIsAdmin checks. However, if job summaries are intended to be accessible to all authenticated users, the current implementation is appropriate.


2068-2080: LGTM! QueryJobSizeSummary correctly handles readiness and conditional delegation.

The handler includes the readiness guard and properly dispatches to different query methods based on whether specific job IDs are filtered.

Same optional consideration as QueryJobSummary regarding access control—verify whether this should be admin-only or available to all authenticated users.


2082-2097: Fix authorization bypass in ActiveAggregationManually.

The current implementation has a critical security flaw: when uid != 0, it sets ok=false and an error reason (lines 2089-2091), but then continues to execute ClusterRollupUsage() (line 2093) and unconditionally overwrites the response with ok=true (line 2094). This allows unauthorized users to trigger expensive cluster-wide rollups while receiving a success response.

Based on learnings, align with the established admin RPC pattern:

 grpc::Status CraneCtldServiceImpl::ActiveAggregationManually(
     ::grpc::ServerContext *context,
     const ::crane::grpc::ActiveAggregationManuallyRequest *request,
     ::crane::grpc::ActiveAggregationManuallyReply *response) {
   if (!g_runtime_status.srv_ready.load(std::memory_order_acquire))
     return {grpc::StatusCode::UNAVAILABLE, "CraneCtld Server is not ready"};
-  if (request->uid() != 0) {
+
+  // Verify TLS certificate matches UID
+  if (auto msg = CheckCertAndUIDAllowed_(context, request->uid()); msg)
+    return {grpc::StatusCode::UNAUTHENTICATED, msg.value()};
+
+  // Require admin privileges
+  auto res = g_account_manager->CheckUidIsAdmin(request->uid());
+  if (!res) {
     response->set_ok(false);
-    response->set_reason("Only the root user can perform this operation");
+    response->set_reason(CraneErrStr(res.error()));
+    return grpc::Status::OK;
   }
+
+  // Authorized: perform manual rollup
   g_db_client->ClusterRollupUsage();
   response->set_ok(true);
-
+  response->clear_reason();
   return grpc::Status::OK;
 }

This ensures:

  1. TLS certificate is validated and matches the claimed UID
  2. Admin privilege check is enforced via CheckUidIsAdmin
  3. Rollup only executes when authorized
  4. Error responses are not overwritten
  5. Consistent with other admin RPCs in this service

Based on learnings

⛔ Skipped due to learnings
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
docs/zh/command/cqueue.md (1)

208-213: The review comment is incorrect and should be rejected.

The actual file structure shows that NO .zh.md files exist anywhere in the documentation — the naming convention is simply .md for all files. The current state of docs/zh/command/cqueue.md contains:

  • Lines 208-212: References to non-existent files (e.g., cbatch.zh.md, crun.zh.md, etc.)
  • Line 213: Correct reference to creport.md (the file that actually exists)

The review suggests changing line 213 from creport.md to creport.zh.md, but this would create a seventh broken link rather than fix an inconsistency. The actual issue is that lines 208-212 contain broken links that should be corrected to use .md instead of .zh.md.

Line 213 is already correct and should not be modified per the review's suggestion.

Likely an incorrect or invalid review comment.

protos/Crane.proto (4)

987-998: LGTM!

The QueryJobSummary message definitions are well-structured with appropriate filter fields and support for streaming responses via repeated items.


1000-1015: LGTM!

The QueryJobSizeSummary messages support comprehensive filtering including the new nodename_list field, enabling fine-grained queries by node allocation.


1017-1024: LGTM!

The ActiveAggregationManually messages correctly include the uid field for authentication, addressing the past review concern.


1078-1080: LGTM!

The new streaming RPCs are appropriately defined with server-side streaming for queries and unary call for manual aggregation trigger.

src/CraneCtld/DbClient.cpp (9)

1555-1558: LGTM!

The nodename_list field is correctly integrated into the task document schema, maintaining consistency between the field array, value tuple, and documentation comments.

Also applies to: 1567-1589, 1600-1640


1663-1666: LGTM!

The nodename_list integration in TaskInCtldToDocument_ mirrors the EmbeddedDb variant correctly.

Also applies to: 1679-1698, 1709-1739


1896-2079: LGTM!

The rollup infrastructure is well-architected with proper:

  • Timestamp tracking using MongoDB $max operator to prevent backward movement
  • Graceful fallback to default time when no prior rollup exists
  • Correct time comparison logic for hour/day/month boundaries
  • Thread-safe execution with mutex protection

2105-2225: LGTM!

The aggregation pipelines correctly:

  • Multiply per-node cpus_alloc by nodes_alloc to compute total allocated CPUs (confirmed correct in past reviews)
  • Calculate cpus_time as total CPUs × duration
  • Include nodename_list in grouping for node-level granularity
  • Upsert results with replace semantics to avoid duplicates

Also applies to: 2227-2366


2368-2520: LGTM!

The QueryJobSizeSummaryByJobIds implementation correctly:

  • Multiplies per-node cpus_alloc by nodes_alloc for raw task table (confirmed correct in past reviews)
  • Handles bucket assignment including jobs smaller than the first threshold (past review issue addressed at line 2492)
  • Filters by nodename_list when specified

2522-2710: LGTM!

The query helpers demonstrate good engineering practices:

  • Type-safe BSON field extraction with fallback defaults
  • Flexible template-based pipeline construction
  • Batched streaming to prevent large message sizes

2712-2943: LGTM!

The QueryJobSummary and QueryJobSizeSummary implementations demonstrate excellent design:

  • Efficient time range splitting to minimize data access
  • Seamless querying across hour/day/month collections via $unionWith
  • Consistent bucket handling (past review issue addressed at line 2930)
  • Streaming responses with batching for scalability

2945-3032: LGTM!

The initialization sequence is correct:

  • MillisecondsToNextHour properly calculates delay to next hour boundary
  • InitTableIndexes creates comprehensive indexes for efficient queries
  • Init() correctly assigns the timer handle to the class member (line 3015), ensuring it remains alive for the lifetime of the client

3034-3055: LGTM!

The background thread and cleanup logic are thread-safe:

  • Idle loop properly checks stop flag before closing handles
  • Destructor correctly signals stop and joins thread

Also applies to: 3083-3086

src/CraneCtld/DbClient.h (4)

22-22: LGTM!

The new include and type definitions are well-structured:

  • Crane.grpc.pb.h provides necessary gRPC types
  • RollupType enum cleanly represents aggregation granularities
  • JobSizeSummaryKey implements proper hash and equality for AbslHash compatibility

Also applies to: 107-127


129-131: LGTM!

Standard lifecycle methods with proper Init() for two-phase initialization.


372-379: LGTM!

The new collection name constants follow a consistent naming convention, and the batch size of 5000 is reasonable for streaming responses.


388-393: LGTM!

The new private members properly support the rollup infrastructure:

  • Mutex for thread-safe rollup execution
  • Background thread for periodic aggregation
  • Atomic stop flag for safe shutdown
  • Timer handle stored as class member (correctly addresses past review)

Comment on lines +190 to +233
nav:
- Home: index.md
- Deployment:
- Overview: deployment/index.md
- Frontend:
- Deployment: deployment/frontend/frontend.md
- Plugins: deployment/frontend/plugins.md
- Backend:
- Rocky Linux 9: deployment/backend/Rocky9.md
- Ubuntu: deployment/backend/Ubuntu.md
- CentOS 7: deployment/backend/CentOS7.md
- Packaging Guide: deployment/backend/packaging.md
- eBPF Support: deployment/backend/eBPF.md
- Configuration:
- Cluster Configuration: deployment/configuration/config.md
- Database Configuration: deployment/configuration/database.md
- PAM Module: deployment/configuration/pam.md
- Multi-node Deployment: deployment/configuration/multi-node.md
- Commands:
- Job Submission:
- cbatch: command/cbatch.md
- crun: command/crun.md
- calloc: command/calloc.md
- Job Monitoring:
- cqueue: command/cqueue.md
- cinfo: command/cinfo.md
- ceff: command/ceff.md
- cacct: command/cacct.md
- creport: command/creport.md
- Job Control:
- ccancel: command/ccancel.md
- ccontrol: command/ccontrol.md
- Account Management:
- cacctmgr: command/cacctmgr.md
- Reference:
- Exit Codes: reference/exit_code.md
- Use Cases:
- Overview: reference/use_cases.md
- Multi-Node Parallel Job: reference/use_case_multi_node.md
- GPU-Accelerated Training: reference/use_case_gpu.md
# - Parameter Sweep with Job Arrays: reference/use_case_param_sweep.md
# - Chained Jobs with Dependencies: reference/use_case_pipeline.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove duplicate nav key definition.

YAML does not allow duplicate keys at the same level. The second nav: block (lines 190-232) will override the first one (lines 48-92), causing the loss of the original navigation structure. This is a critical configuration error flagged by YAMLlint.

The second block appears to be a copy-paste with the intent to add creport at line 218. Instead, add only the creport entry to the original nav block.

# Around line 75 in the FIRST nav block (after cacct):
     - Job Monitoring:
         - cqueue: command/cqueue.md
         - cinfo: command/cinfo.md
         - ceff: command/ceff.md
         - cacct: command/cacct.md
+        - creport: command/creport.md
     - Job Control:

Then delete lines 189-232 (the entire duplicate nav block including the empty line 189).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 190-190: duplication of key "nav" in mapping

(key-duplicates)

🤖 Prompt for AI Agents
In mkdocs.yaml around lines 190 to 232 there is a duplicate top-level "nav"
block that overrides the original navigation; remove the entire duplicate block
(lines 189-232) and instead insert the missing "creport: command/creport.md"
entry into the original nav's "Job Monitoring" section so only one "nav:" exists
and the creport page is preserved.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
docs/en/command/creport.md (1)

146-147: Fix typos and spacing in field descriptions.

A few small doc issues remain:

  • Line 147: add a space after the colon for consistency with other fields: **WCKEY**: wckey name.
  • Lines 240, 280, 320: add a space after the colon in **250-499 CPUs**: CPU minutes in the 250-499 CPUs range.
  • Line 210: PLANEED looks like a typo; change the label to PLANNED (or another intended word) to avoid confusion.

Also applies to: 210-212, 236-243, 276-283, 318-323

scripts/wipe_data.sh (1)

3-3: Update usage string to match implemented modes and fix spacing.

The usage message is still inconsistent and omits the documented import mode. Please align it with the actual modes and spacing, e.g.:

-echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table| 7:hour_table+day_table+month_table |8: summary_time_table)"
+echo "Usage: $0 mode(1:acct_table | 2:qos_table | 3:task_table | 4:user_table | 5:all | 6:acct_table+qos_table+user_table | 7:hour_table+day_table+month_table | 8:summary_time_table | 10:import_tasks)"
mkdocs.yaml (1)

191-233: Remove duplicate top-level nav block (YAML key conflict).

There are two top-level nav: mappings in this file; the second one (lines 191–233) overrides the first and breaks MkDocs/YAML parsing. Since the first nav already contains creport: command/creport.md, delete the entire second nav: block and keep only the primary one.

If you need any items unique to the second block, merge them into the first instead of defining nav twice.

🧹 Nitpick comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

2058-2100: New creport RPC handlers look correct and consistent with existing patterns.

  • QueryJobSummary / QueryJobSizeSummary correctly gate on srv_ready and stream via the DbClient helpers.
  • ActiveAggregationManually now does TLS+UID binding and enforces uid == 0 before calling ClusterRollupUsage, with clear ok / reason in the reply and early return for non-root callers.

If you later introduce non-root admin roles for rollups, consider switching the root check to g_account_manager->CheckUidIsAdmin(...), but the current behavior matches the “root-only” docs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e164202 and e54ed84.

📒 Files selected for processing (5)
  • docs/en/command/creport.md (1 hunks)
  • mkdocs.yaml (2 hunks)
  • scripts/wipe_data.sh (3 hunks)
  • src/CraneCtld/DbClient.cpp (11 hunks)
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
📚 Learning: 2025-05-08T09:35:39.809Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/Pmix/PmixCollRing.cpp:0-0
Timestamp: 2025-05-08T09:35:39.809Z
Learning: In the PMIx implementation for CraneSched, objects referenced in asynchronous gRPC callbacks (like `coll_ring_ctx`) remain valid as long as the parent object (`this`) is not destroyed. The `Coll` class uses shared_ptr management to ensure its lifetime extends through all pending callbacks.

Applied to files:

  • src/CraneCtld/DbClient.cpp
🧬 Code graph analysis (3)
scripts/wipe_data.sh (1)
scripts/wipe_data.py (1)
  • wipe_collection (51-59)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
src/CraneCtld/DbClient.cpp (4)
  • QueryJobSummary (2713-2817)
  • QueryJobSummary (2713-2715)
  • QueryJobSizeSummary (2819-2944)
  • QueryJobSizeSummary (2819-2821)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_runtime_status (166-166)
src/CraneCtld/DbClient.cpp (1)
src/CraneCtld/DbClient.h (3)
  • string (71-73)
  • string (377-426)
  • MongodbClient (93-221)
🪛 YAMLlint (1.37.1)
mkdocs.yaml

[error] 191-191: duplication of key "nav" in mapping

(key-duplicates)

🔇 Additional comments (1)
src/CraneCtld/DbClient.cpp (1)

1555-1644: Rollup, nodename_list, and summary query implementation look solid.

  • nodename_list is consistently added to task documents and carried through hourly/day/month summaries, with matching filter support and composite indexes; this aligns with the new CLI filters.
  • The rollup workflow (RollupTypeToString, Get/UpdateSummaryLastSuccessTime*, NeedRollup, RollupSummary, ClusterRollupUsage, AggregateJobSummaryByHour/DayOrMonth) correctly advances the watermark only after full intervals and uses $merge with whenMatched: "replace" to keep summaries idempotent.
  • QueryJobSummary and both job-size summary paths (QueryJobSizeSummaryByJobIds and QueryJobSizeSummary) correctly split time ranges across hour/day/month collections, use union-with pipelines, and stream batched replies with safe numeric extraction (GetCpusAlloc, GetTotalCpuTime, GetTotalCount, WriteReply).
  • The uvw-based background timer in Init plus MongoDbSumaryTh_ and the destructor cleanly manage the rollup thread and event loop lifetime.

I don’t see correctness or concurrency issues in these additions; they’re ready to ship.

Also applies to: 1897-2080, 2230-2367, 2369-2521, 2693-2944, 2946-3033, 3035-3087

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/DbClient.cpp (1)

1571-1590: Persist wckey into task documents so rollups and filters can work.

I don’t see a "wckey" field in either TaskInEmbeddedDbToDocument_ or TaskInCtldToDocument_, but the rollup pipelines and query helpers expect it:

  • Hourly rollup does $addFields: { wckey: { $ifNull: ["$wckey", ""] } }.
  • Day/month rollups group and project wckey.
  • QueryJobSummary, QueryJobSizeSummary, and QueryJobSizeSummaryByJobIds all filter and/or group by wckey.

Without actually persisting wckey on the job documents, all summaries will end up with wckey == "", and any filtering by wckey will be ineffective. Please either:

  • Add wckey into the task document field arrays/tuples and populate it from the source RPC structs, or
  • Adjust the rollup/query code to derive wckey from whatever field currently carries that information (e.g., inside extra_attr) and store it explicitly before aggregation.

Also applies to: 1680-1699, 2145-2151, 2173-2178, 2303-2315, 2456-2464, 2559-2565, 2604-2610

🧹 Nitpick comments (2)
src/CraneCtld/DbClient.cpp (2)

2523-2567: Consider supporting partition/nodename filters for QueryJobSummary as well.

JobSizeQueryMatch supports partition_name and nodename_list, but JobQueryMatch only honors account/user/qos/wckey. If creport’s job-summary API is expected to filter by partition or nodename too, you may want to mirror those predicates here for consistency; otherwise this asymmetry is fine.

Also applies to: 2569-2633, 2635-2656


2946-2971: Timer/loop lifetime is fixed; double‑check uvw threading assumptions.

Persisting mongodb_task_timer_handle as a member and shutting the loop down from MongoDbSumaryTh_ resolves the earlier issue where the timer died when Init() returned.

One thing to verify: the timer handle is created and start() is called on the thread that runs Init(), but uvw_loop->run() executes on m_mongodb_sum_thread_. If uvw/libuv require handles to be created/started on the same thread as the running loop, you may want to move timer creation into MongoDbSumaryTh_ (or otherwise ensure all loop interactions happen on the loop thread).

Also applies to: 3010-3033, 3035-3056, 3084-3087

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e54ed84 and 58b6136.

📒 Files selected for processing (1)
  • src/CraneCtld/DbClient.cpp (11 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-02T10:11:33.562Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 461
File: src/Craned/CgroupManager.cpp:685-685
Timestamp: 2025-04-02T10:11:33.562Z
Learning: In the CgroupManager's GetJobBpfMapCgroupsV2 method, the developer has confirmed that cg_ino_job_id_map will always contain the key->cgroup_id element, making the CRANE_ASSERT check appropriate rather than requiring additional error handling.

Applied to files:

  • src/CraneCtld/DbClient.cpp
🔇 Additional comments (4)
src/CraneCtld/DbClient.cpp (4)

1555-1644: Nodename_list field is correctly plumbed through task documents and summaries.

The array construction and added field/tuple entries line up, and downstream rollups/indexes already consume nodename_list, so this extension looks sound.

Also applies to: 1664-1742


1897-2012: Rollup watermarking and orchestration logic look correct.

last_success_time is kept monotonic with $max, failed intervals are retried (since cur_start doesn’t advance on exceptions), and NeedRollup gates work on the intended granularity. No functional issues spotted here.

Also applies to: 2014-2080


2106-2367: Hourly/day/month aggregation and index wiring are internally consistent.

Windowing, grouping keys (including nodename_list), and the created indexes all match the way summaries and queries use these collections; the watermark update at the end of each rollup uses the correct cur_start.

Also applies to: 2973-3008


2658-2711: Summary query and bucketing logic are solid.

Time-range splitting, $unionWith usage, numeric extraction helpers, and the revised CPU-size bucketing all behave as expected, avoiding the earlier “small jobs dropped” issue and producing stable streaming batches.

Also applies to: 2713-2817, 2819-2944

Comment on lines +2369 to +2521
bool MongodbClient::QueryJobSizeSummaryByJobIds(
const crane::grpc::QueryJobSizeSummaryRequest* request,
grpc::ServerWriter<::crane::grpc::QueryJobSizeSummaryReply>* stream) {
document filter;
auto grouping_list = request->filter_grouping_list();

if (request->has_filter_start_time()) {
int64_t start_time_sec = request->filter_start_time().seconds();
filter.append(kvp("time_start", [&](sub_document time_start_doc) {
time_start_doc.append(kvp("$gte", start_time_sec));
}));
}

if (request->has_filter_end_time()) {
int64_t end_time_sec = request->filter_end_time().seconds();
filter.append(kvp("time_end", [&](sub_document time_end_doc) {
time_end_doc.append(kvp("$lte", end_time_sec));
}));
}

bool has_accounts_constraint = !request->filter_accounts().empty();
if (has_accounts_constraint) {
filter.append(kvp("account", [&request](sub_document account_doc) {
array account_array;
for (const auto& account : request->filter_accounts()) {
account_array.append(account);
}
account_doc.append(kvp("$in", account_array));
}));
}

bool has_users_constraint = !request->filter_users().empty();
if (has_users_constraint) {
filter.append(kvp("username", [&request](sub_document user_doc) {
array user_array;
for (const auto& user : request->filter_users()) {
user_array.append(user);
}
user_doc.append(kvp("$in", user_array));
}));
}

bool has_qos_constraint = !request->filter_qoss().empty();
if (has_qos_constraint) {
filter.append(kvp("qos", [&request](sub_document qos_doc) {
array qos_array;
for (const auto& qos : request->filter_qoss()) {
qos_array.append(qos);
}
qos_doc.append(kvp("$in", qos_array));
}));
}

bool has_task_ids_constraint = !request->filter_job_ids().empty();
if (has_task_ids_constraint) {
filter.append(kvp("task_id", [&request](sub_document task_id_doc) {
array task_id_array;
for (const auto& task_id : request->filter_job_ids()) {
task_id_array.append(static_cast<std::int32_t>(task_id));
}
task_id_doc.append(kvp("$in", task_id_array));
}));
}

bool has_partitions_constraint = !request->filter_partitions().empty();
if (has_partitions_constraint) {
filter.append(kvp("partition_name", [&request](sub_document partition_doc) {
array partition_array;
for (const auto& partition : request->filter_partitions()) {
partition_array.append(partition);
}
partition_doc.append(kvp("$in", partition_array));
}));
}

bool has_nodename_list_constraint = !request->filter_nodename_list().empty();
if (has_nodename_list_constraint) {
filter.append(
kvp("nodename_list", [&request](sub_document nodename_list_doc) {
array nodename_list_array;
for (const auto& nodename : request->filter_nodename_list()) {
nodename_list_array.append(nodename);
}
nodename_list_doc.append(kvp("$in", nodename_list_array));
}));
}

bool has_wckeys_constraint = !request->filter_wckeys().empty();
if (has_wckeys_constraint) {
filter.append(kvp("wckey", [&request](sub_document wckey_doc) {
array wckey_array;
for (const auto& wckey : request->filter_wckeys()) {
wckey_array.append(wckey);
}
wckey_doc.append(kvp("$in", wckey_array));
}));
}

absl::flat_hash_map<JobSizeSummaryKey, JobSizeSummaryResult> agg_map;
try {
mongocxx::cursor cursor =
(*GetClient_())[m_db_name_][m_task_collection_name_].find(
filter.view());
for (auto view : cursor) {
std::string acc = std::string(view["account"].get_string().value);
std::string wk =
view["wckey"] ? std::string(view["wckey"].get_string().value) : "";
uint32_t cpus_alloc =
static_cast<uint32_t>(view["cpus_alloc"].get_double().value *
view["nodes_alloc"].get_int32().value);
int64_t time_start = view["time_start"].get_int64().value;
int64_t time_end = view["time_end"].get_int64().value;
double total_cpu_time = (time_end - time_start) * cpus_alloc;
if (grouping_list.empty()) {
agg_map[{acc, wk, cpus_alloc}].total_cpu_time += total_cpu_time;
agg_map[{acc, wk, cpus_alloc}].total_count += 1;
} else {
int group_index = 0;
for (const auto group : grouping_list) {
if (cpus_alloc < group) {
break;
}
group_index++;
}
uint32_t bucket = grouping_list[group_index == 0 ? 0 : group_index - 1];
agg_map[{acc, wk, bucket}].total_cpu_time += total_cpu_time;
agg_map[{acc, wk, bucket}].total_count += 1;
}
}
} catch (const bsoncxx::exception& e) {
CRANE_LOGGER_ERROR(m_logger_, e.what());
}

crane::grpc::QueryJobSizeSummaryReply reply;
for (const auto& kv : agg_map) {
crane::grpc::JobSizeSummaryItem item;
item.set_cluster(g_config.CraneClusterName);
item.set_account(kv.first.account);
item.set_wckey(kv.first.wckey);
item.set_cpus_alloc(kv.first.cpus_alloc);
item.set_total_cpu_time(kv.second.total_cpu_time);
item.set_total_count(kv.second.total_count);
reply.add_item_list()->CopyFrom(item);
if (reply.item_list_size() >= MaxJobSummaryBatchSize) {
stream->Write(reply);
reply.clear_item_list();
}
}
if (reply.item_list_size() > 0) {
stream->Write(reply);
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter out step‑only documents in QueryJobSizeSummaryByJobIds.

This query runs on the raw task collection but doesn’t filter on has_job_info. Step‑only documents inserted by InsertSteps lack fields like account, cpus_alloc, time_start, etc., so accessing them here can throw and be caught, but you’ll get log noise and silently missing jobs.

To align with FetchJobRecords and ensure only fully populated jobs are aggregated, add has_job_info == true to the filter:

 bool MongodbClient::QueryJobSizeSummaryByJobIds(
@@
-  bool has_wckeys_constraint = !request->filter_wckeys().empty();
+  bool has_wckeys_constraint = !request->filter_wckeys().empty();
   if (has_wckeys_constraint) {
@@
       wckey_doc.append(kvp("$in", wckey_array));
     }));
   }
+
+  // Only aggregate documents with complete job information.
+  filter.append(kvp("has_job_info", true));
🤖 Prompt for AI Agents
In src/CraneCtld/DbClient.cpp around lines 2369 to 2521, the query builds a
Mongo filter but does not exclude "step-only" documents lacking job fields,
causing exceptions and noisy logs; add a filter clause requiring has_job_info ==
true (same approach as other filters: append a kvp("has_job_info",
[&](sub_document d){ d.append(kvp("$eq", true)); }) or equivalent boolean
equality) before performing the find so only fully populated job documents are
aggregated.

```
### Sample Output

![creport](../../images/creport/topusr.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

将图片的文字内容用代码块格式展示,不再上传原始图片

- Use Cases:
- Overview: reference/use_cases.md
- Multi-Node Parallel Job: reference/use_case_multi_node.md
- GPU-Accelerated Training: reference/use_case_gpu.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段是为什么

return std::string(Internal::CraneStepStatusStrArr[static_cast<int>(status)]);
}

std::string FormatTime(std::time_t t) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么用time_t?不用chrono?


struct TimeRange {
std::time_t start;
std::time_t end;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用chrono


## Display User Resource Consumption Ranking
```bash
creport user topusage [--start-time=...] [--end-time=...] [--account=...] ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

学习下ccontorl的文档结构吧

}

// Try to merge consecutive full-day intervals until the next month's first day.
bool TryPushDays(std::vector<TimeRange> &result, std::time_t &cur,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个东西和string有什么关系?为什么放这里面,而且这个东西不会在其他地方用到吧?

item.set_total_count(GetTotalCount(doc));
item.set_total_cpu_time(GetTotalCpuTime(doc));

reply.add_item_list()->CopyFrom(item);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接Swap就行,

std::time_t end);
bool TryPushDays(std::vector<TimeRange> &result, std::time_t &cur,
std::time_t end);
std::vector<TimeRange> EfficientSplitTimeRange(std::time_t start,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释下吧,
Convert a time duration to minimal Month/Day/Hour time ranges.

}
}

void MongodbClient::QueryJobSizeSummary(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个没有过滤?和QueryJobSizeSummaryByJobIds有什么区别

e.what());
}

UpdateSummaryLastSuccessTimeSec(RollupType::HOUR, cur_start);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个应该成功才调用?,其他的聚合同理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants