Skip to content

Conversation

@huerni
Copy link
Collaborator

@huerni huerni commented Dec 23, 2025

Summary by CodeRabbit

  • New Features

    • Optional Lua scripting to validate/modify job submissions before scheduling; submissions may return descriptive reason messages.
    • Per-task environment variable propagation and a new Lua-related error code.
  • Utilities

    • Added time-string parser to convert durations to minutes.
  • Documentation

    • Comprehensive Lua Guide added (English & Chinese) with examples.
  • Configuration

    • New optional JobSubmitLuaScript configuration entry.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the docs label Dec 23, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

Adds optional Lua-based job-submit/modify hooks: CMake build switches and sol2 fetch, a thread-pooled Lua environment, LuaJobHandler exposing CTLD types to scripts, integration points in scheduler and RPC flows to run async Lua checks, protobuf extensions (task env map and Lua error), and docs/config examples.

Changes

Cohort / File(s) Summary
Build / Dependencies
CMakeLists.txt, dependencies/cmake/CMakeLists.txt, dependencies/cmake/lua_sol/CMakeLists.txt, src/CraneCtld/CMakeLists.txt, src/Utilities/PublicHeader/CMakeLists.txt
Adds CRANE_WITH_LUA option and find_package(Lua) with public cache vars LUA_INCLUDE_DIR/LUA_LIBRARIES; adds lua_sol subproject fetching sol2; conditionally links/includes Lua and sol2 into targets.
Protobufs
protos/Crane.proto, protos/PublicDefs.proto
Adds reason string field to SubmitBatchTaskReply; adds map<string,string> env to TaskInfo; extends ErrCode with ERR_LUA_FAILED = 90.
Ctld configuration & init
src/CraneCtld/CraneCtld.cpp, src/CraneCtld/CtldPublicDefs.h, src/CraneCtld/CtldPublicDefs.cpp, etc/config.yaml
Adds JobSubmitLuaScript config field; initializes g_lua_pool when configured; copies env into TaskInfo; adds commented config example.
Lua core & pool
src/Utilities/PublicHeader/include/crane/Lua.h, src/Utilities/PublicHeader/Lua.cpp, related CMakeLists
Adds crane::LuaEnvironment (Init, LoadLuaScript, function/enum registration) and crane::LuaPool (thread-pooled ExecuteLuaScript -> future) with guarded HAVE_LUA code.
Lua job handler
src/CraneCtld/Lua/LuaJobHandler.h, src/CraneCtld/Lua/LuaJobHandler.cpp
New Ctld::LuaJobHandler implementing JobSubmit and JobModify, registering types, helpers, crane.reservations and partition filtering, returning CraneRichError (guarded by HAVE_LUA).
Scheduler & RPC integration
src/CraneCtld/TaskScheduler.h, src/CraneCtld/TaskScheduler.cpp, src/CraneCtld/RpcService/CtldGrpcServer.cpp
Adds JobSubmitLuaCheck and JobModifyLuaCheck optional async checks; RPC paths run Lua validation prior to scheduling and propagate richer error reasons.
Utilities / helpers
src/Utilities/PublicHeader/include/crane/String.h, src/Utilities/PublicHeader/String.cpp, src/Utilities/PublicHeader/include/crane/PublicHeader.h
Adds util::TimeStr2Mins function and inserts "Lua script validation failed" into the internal error string table.
Documentation / site
docs/en/reference/lua_guide.md, docs/zh/reference/lua_guide.md, mkdocs.yaml
Adds comprehensive English/Chinese Lua guide and exposes new nav entry in docs config.
Minor CTLD build & sources
src/CraneCtld/CMakeLists.txt, src/CraneCtld/CraneCtld.cpp, src/CraneCtld/CtldPublicDefs.*
Adds Lua sources/headers to cranectld build and conditionally wires include dirs/libs when Lua found.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant RPC as CtldGrpcServer
    participant Scheduler
    participant LuaPool
    participant LuaEnv as LuaEnvironment
    participant Script as Lua Script

    Client->>RPC: SubmitTask(s)
    RPC->>Scheduler: JobSubmitLuaCheck(task)
    alt JobSubmitLuaScript configured
        Scheduler->>LuaPool: ExecuteLuaScript(callback -> uses LuaEnv + script)
        LuaPool->>LuaEnv: Init(script) / LoadLuaScript()
        LuaEnv->>Script: register types/functions & call crane_job_submit(job_desc, part_list)
        Script-->>LuaEnv: return success / error
        LuaEnv-->>LuaPool: CraneRichError (future)
        LuaPool-->>Scheduler: future<CraneRichError>
        Scheduler->>RPC: Lua result (ready)
    else No script configured
        Scheduler-->>RPC: std::nullopt (skip Lua)
    end

    alt Lua validation success
        RPC->>Scheduler: proceed with scheduling
        Scheduler-->>RPC: job accepted (job_id)
        RPC-->>Client: Success response
    else Lua validation failed
        RPC-->>Client: Error response with reason (ERR_LUA_FAILED / message)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement, build

Suggested reviewers

  • L-Xiafeng
  • NamelessOIer
  • Nativu5

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.13% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'feat:sol lua' is vague and generic, using only tool names without describing the actual feature or change being implemented. Consider using a more descriptive title that explains what Lua/sol2 integration enables, such as 'feat: Add Lua scripting support for job submission' or 'feat: Integrate sol2 for Lua-based job validation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/lua_sol

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

637-747: ModifyTask now silently drops IDs that aren’t in Pd/Rn queues when Lua is enabled.

The new Lua gating logic in ModifyTask:

std::list<task_id_t> task_ids;
std::list<std::pair<task_id_t, std::future<CraneRichError>>> futures;
for (const auto task_id : request->task_ids()) {
  auto fut = g_task_scheduler->JobModifyLuaCheck(task_id);
  if (fut)
    futures.emplace_back(task_id, std::move(fut.value()));
}

for (auto& [task_id, fut] : futures) {
  auto rich_err = fut.get();
  if (rich_err.code() != CraneErrCode::SUCCESS) {
    response->add_not_modified_tasks(task_id);
    ... // add reason
  } else {
    task_ids.emplace_back(task_id);
  }
}

Only task IDs that both:

  • Are present in Pd/Rn queues (so JobModifyLuaCheck returns a future), and
  • Pass the Lua check,

end up in task_ids and are later processed by the time-limit / priority / hold branches.

Any requested task_id that is not in the pending or running maps causes JobModifyLuaCheck to return std::nullopt and is never:

  • Added to futures,
  • Added to task_ids, or
  • Added to either modified_tasks or not_modified_tasks.

Previously, ChangeTaskTimeLimit / ChangeTaskPriority / HoldReleaseTaskAsync would return ERR_NON_EXISTENT, and the RPC would report those jobs explicitly as not_modified with a reason. Now, such IDs can effectively disappear from the response, which is confusing for callers.

Coupled with JobModifyLuaCheck’s use of std::nullopt to mean both “Lua disabled” and “job not found”, this makes the behavior hard to reason about.

Suggestion:

  • Treat a std::nullopt return from JobModifyLuaCheck as “task not found in pending/running queues” and immediately push:
    • task_id into not_modified_tasks, and
    • A standardized reason like “Task #X was not found in running or pending queue.”
  • Or, change JobModifyLuaCheck’s signature to explicitly return a status (e.g., std::expected<std::optional<std::future<CraneRichError>>, CraneErrCode>) so ModifyTask can distinguish “Lua disabled” from “job not found”.

This keeps the ModifyTask API behavior consistent with pre-Lua semantics and avoids silently ignoring some requested IDs.

🧹 Nitpick comments (7)
CMakeLists.txt (1)

276-283: Simplify redundant cache variable assignments.

Lines 279-280 assign LUA_INCLUDE_DIR and LUA_LIBRARIES to themselves in the cache, which is redundant since find_package(Lua) already sets these as cache variables. Consider removing these lines:

🔎 Suggested simplification
 find_package(Lua QUIET)
 if(LUA_FOUND)
     add_definitions(-DHAVE_LUA)
-    set(LUA_INCLUDE_DIR ${LUA_INCLUDE_DIR} CACHE PATH "Lua include dir")
-    set(LUA_LIBRARIES ${LUA_LIBRARIES} CACHE STRING "Lua libraries")
     message(STATUS "LUA_INCLUDE_DIR = ${LUA_INCLUDE_DIR}")
     message(STATUS "LUA_LIBRARIES = ${LUA_LIBRARIES}")
 endif()
src/Utilities/PublicHeader/CMakeLists.txt (1)

13-16: Make Lua include directory a public usage requirement (or ensure it comes from sol2).

crane/Lua.h is listed in the PUBLIC headers of Utility_PublicHeader, and it’s expected to include Lua / sol2 headers internally. Currently:

target_include_directories(Utility_PublicHeader PUBLIC include)

if (${LUA_FOUND})
    target_include_directories(Utility_PublicHeader PRIVATE ${LUA_INCLUDE_DIR})
endif ()

If crane/Lua.h (or anything it includes) directly includes Lua headers (e.g. lua.hpp) and you rely on Utility_PublicHeader as a PUBLIC dependency in other targets, those consumers may not see ${LUA_INCLUDE_DIR} because it’s only a PRIVATE include dir here.

Unless sol2 already exposes ${LUA_INCLUDE_DIR} transitively, it would be safer to make this PUBLIC:

-if (${LUA_FOUND})
-    target_include_directories(Utility_PublicHeader PRIVATE ${LUA_INCLUDE_DIR})
-endif ()
+if (LUA_FOUND)
+    target_include_directories(Utility_PublicHeader PUBLIC ${LUA_INCLUDE_DIR})
+endif ()

(Also, using if (LUA_FOUND) instead of if (${LUA_FOUND}) avoids accidental string expansion issues.)

Please double-check that a consumer target which only does target_link_libraries(foo PUBLIC Utility_PublicHeader) can compile code including "crane/Lua.h" without manually adding ${LUA_INCLUDE_DIR}.

Also applies to: 17-26, 28-30

src/CraneCtld/Lua/LuaJobHandler.h (1)

32-60: Consider namespacing and centralizing the g_lua_pool global.

LuaJobHandler itself looks fine, but the global pool:

inline std::unique_ptr<crane::LuaPool> g_lua_pool;

is:

  • Defined in a header (inline, so one instance per process) and
  • Placed in the global namespace, separate from other Ctld-wide globals.

This works technically, but for clarity and encapsulation it would be cleaner to:

  • Move g_lua_pool into a namespace (e.g. namespace Ctld { ... }), and/or
  • Declare it as extern in the header and define it in a single .cpp (e.g. alongside other Ctld globals in CtldPublicDefs.cpp), keeping global state concentrated.

That makes initialization and lifetime easier to audit and keeps the global namespace tidy.

Also applies to: 64-64

src/Utilities/PublicHeader/Lua.cpp (2)

29-29: Redundant temporary construction.

std::make_unique<sol::state>(sol::state()) creates an unnecessary temporary. Use direct construction instead.

Suggested fix
-  m_lua_state_ptr_ = std::make_unique<sol::state>(sol::state());
+  m_lua_state_ptr_ = std::make_unique<sol::state>();

46-59: Code duplication with Init method.

The Lua state initialization logic (lines 52-58) duplicates the code in Init (lines 29-35). Consider extracting a private helper or calling Init with the stored script path to avoid duplication.

Also, line 52 has the same redundant construction pattern.

Suggested approach
 bool LuaEnvironment::LoadLuaScript(const std::vector<std::string>& req_funcs) {
   if (m_lua_state_ptr_ == nullptr) {
     CRANE_DEBUG(
         "Lua state (m_lua_state_) is null when loading script '{}'. "
         "This usually indicates Lua VM initialization failed.",
         m_lua_script_);
-    m_lua_state_ptr_ = std::make_unique<sol::state>(sol::state());
-    m_lua_state_ptr_->open_libraries(sol::lib::base, sol::lib::package,
-                                     sol::lib::table, sol::lib::string);
-    m_crane_table_ = (*m_lua_state_ptr_)["crane"].valid()
-                         ? (*m_lua_state_ptr_)["crane"]
-                         : (*m_lua_state_ptr_).create_named_table("crane");
-    RegisterFunctions_();
+    if (!Init(m_lua_script_)) {
+      return false;
+    }
   }
src/Utilities/PublicHeader/include/crane/Lua.h (1)

44-44: Minor: Extra space in return statement.

-  sol::state& GetLuaState() const { return  *m_lua_state_ptr_; }
+  sol::state& GetLuaState() const { return *m_lua_state_ptr_; }
src/CraneCtld/Lua/LuaJobHandler.cpp (1)

32-139: Consider extracting common Lua setup into a helper.

JobSubmit and JobModify share significant setup code: Lua environment initialization, registering global functions, types, and variables, and loading the script. Extracting this into a private helper would reduce duplication and improve maintainability.

Suggested approach
// Private helper
std::unique_ptr<crane::LuaEnvironment> LuaJobHandler::SetupLuaEnv_(
    const std::string& lua_script, CraneRichError* error) {
  auto lua_env = std::make_unique<crane::LuaEnvironment>();
  if (!lua_env->Init(lua_script)) {
    *error = FormatRichErr(CraneErrCode::ERR_LUA_FAILED,
                           "Failed to init lua environment");
    return nullptr;
  }
  RegisterGlobalFunctions_(*lua_env);
  RegisterTypes_(*lua_env);
  RegisterGlobalVariables_(*lua_env);
  if (!lua_env->LoadLuaScript(kReqFxns)) {
    *error = FormatRichErr(CraneErrCode::ERR_LUA_FAILED,
                           "Failed to load lua script");
    return nullptr;
  }
  return lua_env;
}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b4e575 and 734d3d0.

📒 Files selected for processing (24)
  • CMakeLists.txt
  • dependencies/cmake/CMakeLists.txt
  • dependencies/cmake/lua_sol/CMakeLists.txt
  • docs/en/reference/lua_guide.md
  • docs/zh/reference/lua_guide.md
  • etc/config.yaml
  • mkdocs.yaml
  • protos/Crane.proto
  • protos/PublicDefs.proto
  • src/CraneCtld/CMakeLists.txt
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/Lua/LuaJobHandler.cpp
  • src/CraneCtld/Lua/LuaJobHandler.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/Utilities/PublicHeader/CMakeLists.txt
  • src/Utilities/PublicHeader/Lua.cpp
  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/Lua.h
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
  • src/Utilities/PublicHeader/include/crane/String.h
🧰 Additional context used
🧠 Learnings (14)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/Utilities/PublicHeader/include/crane/Lua.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` is an `std::unordered_set<std::string>` which is not inherently thread-safe. The `IsCertAllowed` method in `VaultClientWrapper` involves both read and write operations on this container and could cause race conditions when called from multiple threads.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.h
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.

Applied to files:

  • src/CraneCtld/TaskScheduler.cpp
🧬 Code graph analysis (6)
src/Utilities/PublicHeader/include/crane/String.h (1)
src/Utilities/PublicHeader/String.cpp (2)
  • TimeStr2Mins (544-590)
  • TimeStr2Mins (544-544)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • JobSubmitLuaCheck (1563-1568)
  • JobSubmitLuaCheck (1563-1563)
  • JobModifyLuaCheck (1570-1592)
  • JobModifyLuaCheck (1570-1570)
src/CraneCtld/Lua/LuaJobHandler.cpp (3)
src/Craned/Supervisor/TaskManager.h (1)
  • job_id (51-51)
src/Craned/Core/SupervisorKeeper.cpp (1)
  • pair (123-183)
src/CraneCtld/CranedMetaContainer.h (1)
  • vector (174-177)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Utilities/PublicHeader/Lua.cpp (1)
src/Utilities/PublicHeader/String.cpp (2)
  • TimeStr2Mins (544-590)
  • TimeStr2Mins (544-544)
src/CraneCtld/TaskScheduler.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/CraneCtld/Lua/LuaJobHandler.cpp (4)
  • JobSubmit (32-93)
  • JobSubmit (32-33)
  • JobModify (95-139)
  • JobModify (95-96)
🪛 LanguageTool
docs/en/reference/lua_guide.md

[style] ~35-~35: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...b submission rejected due to error. - crane.* (Crane error codes) — Reject submi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (27)
protos/Crane.proto (1)

90-90: LGTM! Backward-compatible field addition.

The reason field is correctly placed outside the oneof payload block, allowing it to provide explanatory messages for both successful and failed submissions.

etc/config.yaml (1)

89-90: LGTM! Clear documentation for optional Lua script configuration.

The commented example provides users with guidance on enabling Lua-based job submission validation.

mkdocs.yaml (1)

86-86: LGTM! Documentation navigation updated correctly.

The Lua Guide is properly integrated into both English and Chinese navigation sections, maintaining the bilingual documentation structure.

Also applies to: 124-124

src/CraneCtld/CtldPublicDefs.h (1)

186-187: LGTM! Configuration field addition is straightforward.

The JobSubmitLuaScript field appropriately uses std::string for the Lua script path. An empty string will naturally indicate that Lua validation is disabled.

src/CraneCtld/CMakeLists.txt (2)

22-24: LGTM! Lua job handler sources added.

The new source files are correctly integrated into the cranectld executable build.


70-72: LGTM! Conditional Lua linking is correctly configured.

The conditional linking and include directories properly use LUA_FOUND to enable optional Lua support. Note that sol2 is always linked while the Lua runtime libraries are conditional, which is appropriate for header-only sol2 usage.

Also applies to: 94-96

dependencies/cmake/CMakeLists.txt (1)

37-37: LGTM! Sol2 subdirectory correctly integrated.

The lua_sol subdirectory is properly added to the third-party dependencies build, enabling the FetchContent-based sol2 library integration.

dependencies/cmake/lua_sol/CMakeLists.txt (1)

3-7: sol2 version v3.5.0 is current and appropriate for use.

v3.5.0 was released on March 29, 2025, making it the latest stable version. No known security vulnerabilities were identified for this version. The library is distributed under the MIT License and has no known vulnerabilities.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

262-264: Protobuf ErrCode enum correctly includes ERR_LUA_FAILED = 85.

The error message array is properly aligned. "Lua script validation failed" at array index 84 (line 264) correctly corresponds to the ERR_LUA_FAILED = 85 entry in the protobuf ErrCode enum (protos/PublicDefs.proto:697). The comment "// 85-89" confirms the index alignment.

src/CraneCtld/CtldPublicDefs.cpp (2)

1176-1178: LGTM - Username populated from valid password entry.

The code correctly checks if password_entry is valid before calling SetUsername. This ensures the username field is populated from the password database when available.


1322-1327: LGTM - Environment variables propagated to TaskInfo.

The code correctly copies all environment variables from the task's env map to the protobuf TaskInfo.env field, enabling Lua scripts and other components to access task environment data.

src/Utilities/PublicHeader/include/crane/String.h (1)

210-211: LGTM - New time parsing utility function declared.

The new TimeStr2Mins function provides time string parsing capability. The signature is appropriate with std::string_view for efficient string handling.

src/CraneCtld/TaskScheduler.h (1)

781-784: LGTM - Lua pre-check entry points declared.

The new JobSubmitLuaCheck and JobModifyLuaCheck declarations enable optional Lua-based validation. The return type std::optional<std::future<CraneRichError>> appropriately conveys both optionality (when Lua is disabled) and asynchronous execution.

protos/PublicDefs.proto (2)

471-471: LGTM - Environment map field added to TaskInfo.

The new env field enables per-task environment variables to be carried in TaskInfo, which is essential for Lua-based validation that needs access to task environment data.


696-697: LGTM - New error code for Lua failures.

The ERR_LUA_FAILED error code (85) properly extends the error classification to handle Lua script execution failures in the job submission/modification workflow.

docs/zh/reference/lua_guide.md (1)

1-321: LGTM - Comprehensive Lua integration guide in Chinese.

The documentation thoroughly covers Lua script configuration including required functions, available APIs, data structures, and configuration examples. The sample code demonstrates proper usage of the Lua integration.

src/Utilities/PublicHeader/String.cpp (1)

544-590: LGTM - Robust time string parsing implementation.

The TimeStr2Mins function correctly handles multiple time format variations:

  • Special values: "-1", "INFINITE", "UNLIMITED" → -1
  • Optional day prefix with dash separator
  • Hours:minutes:seconds, minutes:seconds, or minutes-only formats
  • Rounds up by 1 minute when seconds > 0

The implementation validates input characters and returns -1 for malformed input, providing consistent error handling.

src/CraneCtld/CraneCtld.cpp (3)

36-36: LGTM - Lua job handler included.

The include for Lua/LuaJobHandler.h enables Lua-based job submission validation when configured.


253-254: LGTM - Lua script path configuration.

The JobSubmitLuaScript configuration field is properly read with an empty string default, allowing optional Lua integration.


923-924: LGTM - Conditional Lua pool initialization.

The Lua pool is correctly initialized only when JobSubmitLuaScript is configured, avoiding unnecessary resource allocation when Lua support is not needed. The initialization order (after task scheduler, before server start) is appropriate.

src/Utilities/PublicHeader/Lua.cpp (2)

175-182: LGTM!

Clean use of protobuf reflection to dynamically register error codes into the Lua environment.


184-194: LGTM!

Proper validation of required Lua functions with informative error messages.

src/Utilities/PublicHeader/include/crane/Lua.h (1)

93-106: LGTM!

Clean use of C++20 concepts to constrain the callback type, and proper forwarding of the callback.

src/CraneCtld/Lua/LuaJobHandler.cpp (4)

190-205: Verify lifetime of captured lua_env reference in lambdas.

Several lambdas capture lua_env by reference (via [&]) and are stored in usertype definitions. These lambdas call lua_env.GetLuaState().create_table(). If the lua_env reference outlives the referenced object, this would cause use-after-free.

The current code appears safe since lua_env is created at the start of JobSubmit/JobModify and lives until those functions return, after Lua script execution completes. However, consider capturing lua_env.GetLuaState() reference directly to make the dependency clearer and more resilient to refactoring.


143-178: LGTM!

Well-structured global function registration with proper use of std::optional for nullable returns and appropriate error logging.


286-394: LGTM!

Comprehensive and well-typed property bindings for TaskInfo, with proper conversions for timestamps, durations, and container types.


487-506: LGTM!

Clean implementation of lazy-loading for reservations using Lua metatable __pairs, ensuring fresh data is queried on each iteration.

@huerni huerni requested a review from L-Xiafeng December 23, 2025 07:03
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

662-747: ModifyTask Lua gating breaks behavior when Lua is disabled and hides non‑existent tasks; also incorrect add_not_modified_tasks(false)

The new Lua gating in ModifyTask has a few correctness issues:

  1. When Lua is disabled (JobSubmitLuaScript empty), no tasks are modified at all

    • JobModifyLuaCheck returns std::nullopt when no Lua script is configured.
    • Your first loop builds futures only when fut is non‑nullopt, and task_ids is populated only from successful Lua results in the second loop.
    • With Lua disabled, futures and task_ids remain empty, so the later loops over task_ids never touch any of the requested task IDs. The RPC returns with all lists empty, effectively becoming a no‑op.

    You probably want the old behavior (no Lua gating) when no script is configured. One simple pattern:

    // Start with all requested ids
    std::list<task_id_t> task_ids(request->task_ids().begin(),
                                  request->task_ids().end());
    
    std::list<std::pair<task_id_t, std::future<CraneRichError>>> futures;
    for (auto task_id : task_ids) {
      auto fut = g_task_scheduler->JobModifyLuaCheck(task_id);
      if (fut) futures.emplace_back(task_id, std::move(*fut));
    }
    
    for (auto& [task_id, fut] : futures) {
      auto rich_err = fut.get();
      if (rich_err.code() != CraneErrCode::SUCCESS) {
        response->add_not_modified_tasks(task_id);
        response->add_not_modified_reasons(
            rich_err.description().empty()
                ? CraneErrStr(rich_err.code())
                : rich_err.description());
        task_ids.remove(task_id);  // or track a separate allowed set
      }
    }

    This way:

    • If Lua is not configured (nullopt), the task still flows through and is handled as before.
    • If Lua rejects a task, it is reported in not_modified_* and excluded from the modification loops.
  2. Non‑existent tasks now drop out silently

    JobModifyLuaCheck also returns std::nullopt when the task_id isn’t found in the pending/running maps. With the new logic, such IDs never make it into task_ids and never get an error recorded, whereas previously ChangeTaskTimeLimit / ChangeTaskPriority / HoldReleaseTaskAsync would return ERR_NON_EXISTENT and that would be surfaced to the client. The restructuring above (starting from all requested IDs and only removing those explicitly rejected by Lua) preserves the prior error-reporting behavior.

  3. Typo: add_not_modified_tasks(false)

    In the Hold branch:

    } else {
      response->add_not_modified_tasks(false);
      response->add_not_modified_reasons(
          fmt::format("Failed to hold/release job: {}.", CraneErrStr(err)));
    }

    should use task_id rather than false, otherwise you’re appending 0 as a bogus “not modified” task id:

  •    response->add_not_modified_tasks(false);
    
  •    response->add_not_modified_tasks(task_id);
    
    
    

Addressing these points should restore the original ModifyTask semantics when Lua is off, add proper reporting for Lua‑rejected tasks, and fix the incorrect task id in the error case.

♻️ Duplicate comments (5)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

512-552: Multi‑submit (SubmitBatchTasks) still bypasses Lua validation and can bypass policy

You’ve correctly wired JobSubmitLuaCheck into:

  • Interactive submissions via CforedStream (TASK_REQUEST), and
  • Single batch submissions via SubmitBatchTask.

However, SubmitBatchTasks (the count > 1 API) still goes directly to SubmitTaskToScheduler without any Lua check:

for (int i = 0; i < task_count; i++) {
  auto task = std::make_unique<TaskInCtld>();
  task->SetFieldsByTaskToCtld(task_to_ctld);

  auto result = g_task_scheduler->SubmitTaskToScheduler(std::move(task));
  results.emplace_back(std::move(result));
}

This allows callers to bypass any site policies enforced in Lua by using the multi-submit API instead of the single-task API.

I’d recommend:

  • Running JobSubmitLuaCheck(task.get()) for each task in the loop, and
  • If Lua returns a non‑SUCCESS CraneRichError, append an appropriate error code to code_list (and a sentinel like 0 in task_id_list if needed), mirroring the single-submit behavior, otherwise enqueue via SubmitTaskToScheduler.

That keeps all submission entry points consistent with respect to Lua-based validation.

src/Utilities/PublicHeader/include/crane/Lua.h (1)

73-90: Fix LuaPool string‑based ExecuteLuaScript: early returns and success result

The ExecuteLuaScript(const std::string& lua_script) overload still has the issues previously noted:

  • If Init(lua_script) fails, the lambda continues and calls LoadLuaScript on a not‑properly‑initialized environment.
  • If LoadLuaScript fails, the lambda continues; only the last failure “wins”.
  • On success, result remains default-constructed instead of being explicitly set to a SUCCESS CraneRichError.

This risks undefined behavior on init failure and makes the success path ambiguous. Consider restructuring as:

  m_thread_pool_->detach_task([lua_script, promise]() {
-   CraneRichError result;
-   auto lua_env = std::make_unique<crane::LuaEnvironment>();
-   if (!lua_env->Init(lua_script))
-     result = FormatRichErr(CraneErrCode::ERR_LUA_FAILED,
-                            "Failed to init lua environment");
-
-   if (!lua_env->LoadLuaScript({}))
-     result = FormatRichErr(CraneErrCode::ERR_LUA_FAILED,
-                            "Failed to load lua script");
-
-   promise->set_value(result);
+   auto lua_env = std::make_unique<crane::LuaEnvironment>();
+   if (!lua_env->Init(lua_script)) {
+     promise->set_value(FormatRichErr(
+         CraneErrCode::ERR_LUA_FAILED, "Failed to init lua environment"));
+     return;
+   }
+
+   if (!lua_env->LoadLuaScript({})) {
+     promise->set_value(FormatRichErr(
+         CraneErrCode::ERR_LUA_FAILED, "Failed to load lua script"));
+     return;
+   }
+
+   promise->set_value(FormatRichErr(CraneErrCode::SUCCESS, ""));
  });

This ensures the promise is set exactly once with a well-defined CraneRichError in all paths.

docs/en/reference/lua_guide.md (1)

73-79: Fix minor inconsistencies and typos in the English Lua guide

The guide is very detailed; a few small fixes will make it match the actual API and examples:

  • In the function table, change the logging row to consistently use crane.log_info / crane.log_debug (not bare log_info / log_debug), matching the example script.
  • In the job_desc attributes table, included_nodes / excluded_nodes are declared as string, but the example treats them as arrays (table.concat(job_desc.included_nodes, ", ")), so the documented type should be table (list of strings).
  • In the crane.reservations table, fix the typo partitonpartition.
  • The meta example uses job_desc.bash_meta; this almost certainly should be batch_meta to match the underlying concept.
  • As in the Chinese doc, the documented signature for crane.set_job_env_field does not match the example call; please align the argument order with the real binding.

Also applies to: 86-112, 175-191, 339-345, 272-279

src/CraneCtld/Lua/LuaJobHandler.cpp (1)

519-529: Potential std::out_of_range from .at() call.

This issue was previously flagged. Line 521 uses .at(actual_account) which throws if the key doesn't exist.

src/Utilities/PublicHeader/Lua.cpp (1)

97-144: Missing validation for empty variadic arguments.

This issue was previously flagged. The logging functions access va[0] without checking if va is empty, causing undefined behavior when called with no arguments.

🧹 Nitpick comments (9)
CMakeLists.txt (1)

89-90: Lua CMake option wiring looks fine; optionally surface missing-Lua as a warning

The CRANE_WITH_LUA option and find_package(Lua QUIET) block are sane and keep Lua fully optional. You may want to emit a message(WARNING ...) when CRANE_WITH_LUA is ON but LUA_FOUND is false so users notice that their Lua config is being ignored instead of silently disabling Lua features.

Also applies to: 279-288

src/Utilities/PublicHeader/String.cpp (1)

544-593: TimeStr2Mins logic is reasonable; consider clarifying 2‑field format and rejecting negative parts

Implementation looks correct for the supported forms (D-HH:MM:SS, HH:MM:SS, MM:SS, MM) and is defensive about malformed input by returning -1. Two small optional refinements:

  • The 2‑field case is interpreted as minutes:seconds, not hours:minutes. If that’s intentional, it’d be good to document it at the call sites / docs; otherwise you might want to swap the mapping.
  • Currently any '-' is allowed in the time part, so inputs like "10:-1:00" will parse with a negative component. If you want to treat such inputs as invalid, you could either disallow '-' after the day prefix, or explicitly reject negative values after SimpleAtoi.
src/CraneCtld/CraneCtld.cpp (1)

36-36: Lua hook wiring into ctld init/config looks good; consider explicit cleanup

The new JobSubmitLuaScript config field and conditional initialization of g_lua_pool are wired in cleanly and fail fast on Init() failure, which is appropriate for a core control daemon.

For symmetry with other globals, you might optionally add g_lua_pool.reset(); in DestroyCtldGlobalVariables() so the Lua thread pool is explicitly torn down alongside g_thread_pool and the DB clients, instead of relying solely on static destruction at process exit.

Also applies to: 253-255, 883-886

src/CraneCtld/Lua/LuaJobHandler.h (3)

20-21: Misleading comment about include order.

The comment "Precompiled header comes first!" appears after CtldPublicDefs.h is included. If CtldPublicDefs.h is the precompiled header, the comment should precede it. Otherwise, remove or relocate this comment.


34-34: explicit is unnecessary on default constructor.

The explicit keyword has no effect on a default constructor since there's no parameter to prevent implicit conversions from.

Suggested fix
-  explicit LuaJobHandler() = default;
+  LuaJobHandler() = default;

64-64: Global g_lua_pool declared outside namespace.

The global g_lua_pool is declared in the global namespace while related code is in Ctld. Consider moving it inside the Ctld namespace for consistency, or document why it needs global scope.

src/CraneCtld/Lua/LuaJobHandler.cpp (1)

44-61: Querying all tasks on each pairs(crane.jobs) iteration.

Every time Lua iterates over crane.jobs, QueryTasksInRam fetches all tasks. For large job counts, this could be expensive. Consider caching or documenting this behavior for script authors.

src/Utilities/PublicHeader/Lua.cpp (2)

28-28: Redundant temporary in make_unique.

std::make_unique<sol::state>(sol::state()) creates a temporary sol::state and then move-constructs into the unique_ptr. Use std::make_unique<sol::state>() for direct construction.

Suggested fix
-  m_lua_state_ptr_ = std::make_unique<sol::state>(sol::state());
+  m_lua_state_ptr_ = std::make_unique<sol::state>();

Apply the same fix at line 46.


41-53: Duplicated initialization logic.

Lines 41-53 duplicate the initialization logic from Init(). Extract a private helper to avoid repetition and ensure consistency.

Suggested refactor
+void LuaEnvironment::InitState_() {
+  m_lua_state_ptr_ = std::make_unique<sol::state>();
+  m_lua_state_ptr_->open_libraries(sol::lib::base, sol::lib::package,
+                                   sol::lib::table, sol::lib::string);
+  m_crane_table_ = (*m_lua_state_ptr_)["crane"].valid()
+                       ? (*m_lua_state_ptr_)["crane"]
+                       : (*m_lua_state_ptr_).create_named_table("crane");
+  RegisterFunctions_();
+}
+
 bool LuaEnvironment::Init(const std::string& script) {
 #ifdef HAVE_LUA
   m_lua_script_ = script;
-  m_lua_state_ptr_ = std::make_unique<sol::state>(sol::state());
-  m_lua_state_ptr_->open_libraries(sol::lib::base, sol::lib::package,
-                                   sol::lib::table, sol::lib::string);
-  m_crane_table_ = (*m_lua_state_ptr_)["crane"].valid()
-                       ? (*m_lua_state_ptr_)["crane"]
-                       : (*m_lua_state_ptr_).create_named_table("crane");
-  RegisterFunctions_();
+  InitState_();
 #endif
   return true;
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 734d3d0 and b3b7a53.

📒 Files selected for processing (12)
  • CMakeLists.txt
  • docs/en/reference/lua_guide.md
  • docs/zh/reference/lua_guide.md
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
  • src/CraneCtld/Lua/LuaJobHandler.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/Utilities/PublicHeader/Lua.cpp
  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/Lua.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧠 Learnings (12)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` is an `std::unordered_set<std::string>` which is not inherently thread-safe. The `IsCertAllowed` method in `VaultClientWrapper` involves both read and write operations on this container and could cause race conditions when called from multiple threads.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.

Applied to files:

  • src/CraneCtld/TaskScheduler.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/CraneCtld/TaskScheduler.h
🧬 Code graph analysis (4)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Utilities/PublicHeader/include/crane/Lua.h (1)
src/Utilities/PublicHeader/String.cpp (2)
  • SetCurrentThreadName (371-379)
  • SetCurrentThreadName (371-371)
src/CraneCtld/TaskScheduler.h (1)
src/CraneCtld/TaskScheduler.cpp (4)
  • JobSubmitLuaCheck (1563-1569)
  • JobSubmitLuaCheck (1563-1564)
  • JobModifyLuaCheck (1571-1596)
  • JobModifyLuaCheck (1571-1572)
src/Utilities/PublicHeader/Lua.cpp (2)
src/Utilities/PublicHeader/include/crane/Lua.h (1)
  • Init (62-71)
src/Utilities/PublicHeader/String.cpp (2)
  • TimeStr2Mins (544-593)
  • TimeStr2Mins (544-544)
🪛 LanguageTool
docs/en/reference/lua_guide.md

[style] ~35-~35: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...b submission rejected due to error. - crane.* (Crane error codes) — Reject submi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🔇 Additional comments (6)
src/CraneCtld/Lua/LuaJobHandler.cpp (4)

73-78: Verify Lua function signature expectations.

The crane_job_submit Lua function is called with (task, part_list, task->uid). Ensure the Lua script contract and documentation match this signature. If the Lua function expects different arguments, runtime errors will occur.


116-124: task_info is unused after SetFieldsOfTaskInfo.

task_info is populated at line 117 and passed to modify() at line 124, but the Lua function receives both task (the mutable TaskInCtld*) and task_info (a copy). Verify this is intentional—if Lua modifies task_info, those changes are lost.


181-486: LGTM on type registrations.

The sol2 usertype bindings for ResourceView, TaskInCtld, TaskInfo, PartitionInfo, and ReservationInfo are well-structured and expose the necessary properties to Lua.


191-206: Capture lua_env by value or store a reference to the underlying state.

The lambdas capture lua_env by reference at lines 191-206, 246, 272, 335, and 390. While the LuaEnvironment object remains valid (held by unique_ptr), the reference parameter goes out of scope when RegisterTypes_ returns. When sol2 invokes these lambdas during Lua property access (synchronously during LoadLuaScript() and Lua function calls), they access a dangling reference parameter—undefined behavior even though the target survives.

Either:

  • Capture lua_env by value: [lua_env] instead of [&]
  • Or store and reuse lua_state: capture it directly and avoid the parameter reference entirely

Likely an incorrect or invalid review comment.

src/Utilities/PublicHeader/Lua.cpp (2)

174-181: LGTM on error table registration.

The RegisterOutputErrTab_ function correctly iterates over the protobuf enum descriptor and exposes all error codes to Lua.


183-193: LGTM on function validation.

The CheckLuaScriptFunctions_ correctly validates that required functions exist and are callable in the Lua state.

Comment on lines +781 to +786
std::optional<std::future<CraneRichError>> JobSubmitLuaCheck(
TaskInCtld* task);

std::optional<std::future<CraneRichError>> JobModifyLuaCheck(
task_id_t task_id);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

JobModifyLuaCheck implementation risks use‑after‑free on internal task maps

The public API shape (std::optional<std::future<CraneRichError>>) is fine, but based on the TaskScheduler.cpp snippet, JobModifyLuaCheck currently does:

  • Lock pending/running map mutex,
  • Find task_id,
  • Capture the map iterator by value in a lambda submitted to g_lua_pool->ExecuteLuaScript,
  • Release the mutex before the async lambda runs, and inside the lambda call pd_iter->second.get() / rn_iter->second.get().

If the underlying m_pending_task_map_ or m_running_task_map_ entry is erased or the map rehashes before the Lua task runs, that iterator (and the TaskInCtld* obtained from it) become invalid, leading to undefined behavior.

It’d be safer to:

  • Capture a stable object for Lua (e.g., a copy/struct projection of just the fields needed by Lua), or
  • Capture a TaskInCtld* plus some lifetime guarantee (e.g., keep the job in a separate holder or delay erasure until Lua completes), or
  • Rework the API so Lua validation is done synchronously under the same lock scope.
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.h around lines 781-786 the current
JobModifyLuaCheck implementation (in TaskScheduler.cpp) captures a map iterator
and releases the map mutex before the async Lua lambda runs, risking
use-after-free if the map entry is erased or the map rehashes; fix this by
avoiding capturing iterators: either (1) construct a small POD struct containing
only the fields Lua needs and capture that copy in the lambda, or (2) change the
map value to a shared ownership type (e.g., std::shared_ptr<TaskInCtld>) and
capture a shared_ptr to the task in the lambda so the task lifetime is extended
until Lua completes, or (3) perform the Lua validation synchronously under the
lock; implement one of these options, remove any direct iterator capture from
the async lambda, and ensure the mutex is released only after a safe copy or
ownership transfer has been captured.

Comment on lines +25 to +29
#include <BS_thread_pool.hpp>
#include <memory>

#include "crane/Logger.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include header for util::SetCurrentThreadName to make LuaPool self-contained

LuaPool::Init uses util::SetCurrentThreadName("LuaThreadPool") in the thread-pool worker lambda, but this header only includes BS_thread_pool.hpp and crane/Logger.h. Unless util::SetCurrentThreadName is guaranteed to be visible via transitive includes, this will fail to compile.

Given SetCurrentThreadName is defined in util (see src/Utilities/PublicHeader/String.cpp), it’s safer for this header to include its declaration explicitly:

#include "crane/Logger.h"
#include "crane/String.h"  // for util::SetCurrentThreadName

or to move the SetCurrentThreadName usage into a .cpp that already includes crane/String.h.

Also applies to: 62-71, 108-109

Comment on lines +25 to +37
bool LuaEnvironment::Init(const std::string& script) {
#ifdef HAVE_LUA
m_lua_script_ = script;
m_lua_state_ptr_ = std::make_unique<sol::state>(sol::state());
m_lua_state_ptr_->open_libraries(sol::lib::base, sol::lib::package,
sol::lib::table, sol::lib::string);
m_crane_table_ = (*m_lua_state_ptr_)["crane"].valid()
? (*m_lua_state_ptr_)["crane"]
: (*m_lua_state_ptr_).create_named_table("crane");
RegisterFunctions_();
#endif
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Init() always returns true even when HAVE_LUA is undefined.

When HAVE_LUA is not defined, Init() does nothing but returns true. This could mislead callers into thinking Lua is available. Consider returning false or logging a warning when Lua support is disabled.

Suggested fix
 bool LuaEnvironment::Init(const std::string& script) {
+#ifndef HAVE_LUA
+  CRANE_WARN("Lua support is not enabled");
+  return false;
+#else
-#ifdef HAVE_LUA
   m_lua_script_ = script;
   m_lua_state_ptr_ = std::make_unique<sol::state>();
   m_lua_state_ptr_->open_libraries(sol::lib::base, sol::lib::package,
                                    sol::lib::table, sol::lib::string);
   m_crane_table_ = (*m_lua_state_ptr_)["crane"].valid()
                        ? (*m_lua_state_ptr_)["crane"]
                        : (*m_lua_state_ptr_).create_named_table("crane");
   RegisterFunctions_();
-#endif
   return true;
+#endif
 }
🤖 Prompt for AI Agents
In src/Utilities/PublicHeader/Lua.cpp around lines 25 to 37, Init() currently
always returns true even when HAVE_LUA is undefined; update the function so that
when HAVE_LUA is not defined it returns false (and optionally logs a warning
about Lua support being disabled) instead of indicating success, and keep the
existing initialization path under the HAVE_LUA block returning true on success.


if (!g_config.JobSubmitLuaScript.empty()) {
g_lua_pool = std::make_unique<crane::LuaPool>();
if (!g_lua_pool->Init()) std::exit(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加下日志吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

里面有日志

}

std::list<task_id_t> task_ids;
std::list<std::pair<task_id_t, std::future<CraneRichError>>> futures;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些地方是要求用list吗?最好换成vector,这里都知道size吧


crane::grpc::TaskInfo task_info;
task->SetFieldsOfTaskInfo(&task_info);
std::list<crane::grpc::PartitionInfo> part_list;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以及这里的list

@L-Xiafeng
Copy link
Collaborator

可以先提测

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

662-747: Fix ModifyTask Lua gating and incorrect task id reporting.

Two problems in the new ModifyTask Lua integration:

  1. Tasks with JobModifyLuaCheck returning empty optional are silently dropped.
    You currently do:

    std::list<task_id_t> task_ids;
    std::vector<std::pair<task_id_t, std::future<CraneRichError>>> futures;
    
    for (const auto task_id : request->task_ids()) {
      auto fut = g_task_scheduler->JobModifyLuaCheck(task_id);
      if (fut) futures.emplace_back(task_id, std::move(fut.value()));
    }
    
    for (auto& [task_id, fut] : futures) {
      auto rich_err = fut.get();
      if (rich_err.code() != CraneErrCode::SUCCESS) {
        // push into not_modified_*
      } else {
        task_ids.emplace_back(task_id);
      }
    }

    If JobModifyLuaCheck ever returns std::nullopt (e.g., Lua disabled or job not found), that task_id is neither added to task_ids nor recorded in not_modified_tasks / not_modified_reasons; subsequent attribute handlers iterate only task_ids. This makes some requested task IDs disappear from the reply.

    Either guarantee JobModifyLuaCheck always returns a future, or (more robustly) treat std::nullopt as “no Lua check; allow modification” by inserting such IDs directly into task_ids.

  2. Hold attribute branch reports false instead of the actual task id.
    In the Hold branch:

    } else {
      response->add_not_modified_tasks(false);
      response->add_not_modified_reasons(
          fmt::format("Failed to hold/release job: {}.", CraneErrStr(err)));
    }

    false is converted to 0, so clients see a bogus task id rather than the real task_id.

A possible fix that addresses both issues:

Proposed fix
-  std::list<task_id_t> task_ids;
-  std::vector<std::pair<task_id_t, std::future<CraneRichError>>> futures;
-  futures.reserve(request->task_ids().size());
-  for (const auto task_id : request->task_ids()) {
-    auto fut = g_task_scheduler->JobModifyLuaCheck(task_id);
-    if (fut) futures.emplace_back(task_id, std::move(fut.value()));
-  }
-
-  for (auto& [task_id, fut] : futures) {
-    auto rich_err = fut.get();
-    if (rich_err.code() != CraneErrCode::SUCCESS) {
-      response->add_not_modified_tasks(task_id);
-      if (rich_err.description().empty())
-        response->add_not_modified_reasons(CraneErrStr(rich_err.code()));
-      else
-        response->add_not_modified_reasons(rich_err.description());
-    } else {
-      task_ids.emplace_back(task_id);
-    }
-  }
+  std::list<task_id_t> task_ids;
+  std::vector<std::pair<task_id_t, std::future<CraneRichError>>> futures;
+  futures.reserve(request->task_ids().size());
+
+  for (const auto task_id : request->task_ids()) {
+    auto fut = g_task_scheduler->JobModifyLuaCheck(task_id);
+    if (!fut) {
+      // No Lua check (e.g. Lua disabled) – allow the modification path.
+      task_ids.emplace_back(task_id);
+      continue;
+    }
+    futures.emplace_back(task_id, std::move(fut.value()));
+  }
+
+  for (auto& [task_id, fut] : futures) {
+    auto rich_err = fut.get();
+    if (rich_err.code() != CraneErrCode::SUCCESS) {
+      response->add_not_modified_tasks(task_id);
+      if (rich_err.description().empty())
+        response->add_not_modified_reasons(CraneErrStr(rich_err.code()));
+      else
+        response->add_not_modified_reasons(rich_err.description());
+    } else {
+      task_ids.emplace_back(task_id);
+    }
+  }
@@
   } else if (request->attribute() == ModifyTaskRequest::Hold) {
@@
-    for (auto& [task_id, res] : results) {
+    for (auto& [task_id, res] : results) {
       err = res.get();
       if (err == CraneErrCode::SUCCESS) {
         response->add_modified_tasks(task_id);
@@
       } else {
-        response->add_not_modified_tasks(false);
+        response->add_not_modified_tasks(task_id);
         response->add_not_modified_reasons(
             fmt::format("Failed to hold/release job: {}.", CraneErrStr(err)));
       }
♻️ Duplicate comments (2)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

512-549: Multi-submit (SubmitBatchTasks) still bypasses Lua validation.

This path still enqueues count copies of task_to_ctld directly into SubmitTaskToScheduler without calling JobSubmitLuaCheck per instance:

for (int i = 0; i < task_count; i++) {
  auto task = std::make_unique<TaskInCtld>();
  task->SetFieldsByTaskToCtld(task_to_ctld);

  auto result = g_task_scheduler->SubmitTaskToScheduler(std::move(task));
  results.emplace_back(std::move(result));
}

If sites rely on Lua scripts to enforce submit‑time policy, callers can continue to circumvent those policies by using this multi-submit RPC instead of the single-submit APIs.

Align this code path with SubmitBatchTask by running JobSubmitLuaCheck for each synthetic task and, on failure, returning an appropriate per‑task error entry (and a sentinel ID) while skipping scheduler submission for that instance.

docs/en/reference/lua_guide.md (1)

73-79: Synchronize function names, types, and typos with the actual Lua bindings.

Several inconsistencies remain between this guide and the bindings / examples:

  • The logging row shows crane.log_error("msg") / log_info("msg") / log_debug("msg"); the latter two should be crane.log_info / crane.log_debug.
  • In job_desc attributes, included_nodes and excluded_nodes are documented as string, but the example uses them as arrays (table.concat(job_desc.included_nodes, ", ") etc.). The type should reflect this (e.g. table or table(string list)).
  • In crane.reservations, partiton is misspelled; should be partition.
  • The “meta (batch_meta / interactive_meta)” snippet uses job_desc.bash_meta instead of batch_meta.

Aligning these with the real API will make the doc less confusing for users relying on it as a reference.

Also applies to: 86-112, 175-191, 339-345

🧹 Nitpick comments (2)
src/CraneCtld/CraneCtld.cpp (1)

36-36: Lua pool initialization path is sound; consider explicit teardown for symmetry.

The flow

  • parses JobSubmitLuaScript into g_config.JobSubmitLuaScript, and
  • initializes g_lua_pool only when the script path is non-empty, failing fast if Init() returns false,

which is correct and keeps Lua completely disabled by default.

For cleaner lifetime symmetry with other globals in DestroyCtldGlobalVariables(), you may optionally add an explicit g_lua_pool.reset(); there so Lua resources are released alongside the scheduler, DB client, etc., rather than relying solely on static destruction.

Also applies to: 253-255, 896-899

src/CraneCtld/Lua/LuaJobHandler.h (1)

32-64: LuaJobHandler API is fine; consider relocating g_lua_pool out of this header.

The LuaJobHandler interface (static JobSubmit / JobModify plus private registration helpers under HAVE_LUA) looks good.

For inline std::unique_ptr<crane::LuaPool> g_lua_pool;, you might consider moving this global into a more central “globals” header or a .cpp with an extern declaration here, to keep this header focused on the handler API and avoid exposing yet another global from a feature-specific header. Not required functionally, but it will make ownership of global state easier to track.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3b7a53 and e018d62.

📒 Files selected for processing (24)
  • CMakeLists.txt
  • dependencies/cmake/CMakeLists.txt
  • dependencies/cmake/lua_sol/CMakeLists.txt
  • docs/en/reference/lua_guide.md
  • docs/zh/reference/lua_guide.md
  • etc/config.yaml
  • mkdocs.yaml
  • protos/Crane.proto
  • protos/PublicDefs.proto
  • src/CraneCtld/CMakeLists.txt
  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/Lua/LuaJobHandler.cpp
  • src/CraneCtld/Lua/LuaJobHandler.h
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/TaskScheduler.cpp
  • src/CraneCtld/TaskScheduler.h
  • src/Utilities/PublicHeader/CMakeLists.txt
  • src/Utilities/PublicHeader/Lua.cpp
  • src/Utilities/PublicHeader/String.cpp
  • src/Utilities/PublicHeader/include/crane/Lua.h
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h
  • src/Utilities/PublicHeader/include/crane/String.h
🚧 Files skipped from review as they are similar to previous changes (13)
  • mkdocs.yaml
  • src/CraneCtld/CtldPublicDefs.cpp
  • protos/Crane.proto
  • src/CraneCtld/TaskScheduler.h
  • protos/PublicDefs.proto
  • src/Utilities/PublicHeader/CMakeLists.txt
  • dependencies/cmake/lua_sol/CMakeLists.txt
  • src/CraneCtld/TaskScheduler.cpp
  • etc/config.yaml
  • docs/zh/reference/lua_guide.md
  • CMakeLists.txt
  • src/Utilities/PublicHeader/include/crane/Lua.h
  • dependencies/cmake/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (11)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.

Applied to files:

  • src/CraneCtld/CraneCtld.cpp
  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-09T01:54:39.465Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
  • src/CraneCtld/Lua/LuaJobHandler.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` in `VaultClientWrapper` is implemented as a `phmap::parallel_flat_hash_set<std::string>`, which is a thread-safe container designed for concurrent access, making it safe to use in multithreaded contexts without additional synchronization.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-03-31T09:29:40.388Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In the CraneSched project, the `m_allowed_certs_` is an `std::unordered_set<std::string>` which is not inherently thread-safe. The `IsCertAllowed` method in `VaultClientWrapper` involves both read and write operations on this container and could cause race conditions when called from multiple threads.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.

Applied to files:

  • src/CraneCtld/RpcService/CtldGrpcServer.cpp
🧬 Code graph analysis (3)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
  • g_config (160-160)
src/Utilities/PublicHeader/Lua.cpp (2)
src/Utilities/PublicHeader/include/crane/Lua.h (1)
  • Init (62-71)
src/Utilities/PublicHeader/String.cpp (2)
  • TimeStr2Mins (544-593)
  • TimeStr2Mins (544-544)
src/CraneCtld/Lua/LuaJobHandler.cpp (1)
src/CraneCtld/CtldPublicDefs.h (1)
  • node_num (505-505)
🪛 LanguageTool
docs/en/reference/lua_guide.md

[style] ~35-~35: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...b submission rejected due to error. - crane.* (Crane error codes) — Reject submi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (5)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

264-264: LGTM! Trailing commas improve maintainability.

Adding trailing commas to array entries makes future additions cleaner and reduces diff noise.

Also applies to: 271-271

src/Utilities/PublicHeader/String.cpp (1)

544-593: Time string parsing logic looks consistent with SLURM-style formats.

The implementation correctly handles D-HH:MM:SS / HH:MM:SS / MM:SS / MM and the “infinite” sentinels, with simple validation and minute rounding via secs > 0 ? 1 : 0. No issues with the core parsing logic; see the header comment for the declaration/definition type alignment.

src/CraneCtld/CtldPublicDefs.h (1)

186-187: JobSubmitLuaScript config field is well-placed.

Adding JobSubmitLuaScript as a plain std::string toggle for Lua integration fits the existing config structure and defaults safely to “disabled” when empty.

src/CraneCtld/CMakeLists.txt (1)

22-23: CMake wiring for Lua integration is reasonable.

Including Lua/LuaJobHandler.* in cranectld and guarding Lua libraries and headers with LUA_FOUND matches the intended optional-Lua design; this should not affect non-Lua builds.

Also applies to: 69-72, 94-96

src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)

28-29: Interactive-task Lua pre-check integration looks correct.

The CforedStream TASK_REQUEST path now calls JobSubmitLuaCheck and only skips scheduling when Lua returns a non‑SUCCESS CraneRichError, while still submitting as before when Lua is disabled or returns SUCCESS. The result wiring into WriteTaskIdReply preserves previous behavior for non-Lua clusters.

Also applies to: 320-347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants