[Feature] C++ Extension: Introduce `DocNode, NodeTransform, TextSplitterBase, SentenseSplitter` by CompromisedKiwi · Pull Request #1022 · LazyAGI/LazyLLM

CompromisedKiwi · 2026-02-07T06:15:57Z

📌 PR 内容 / PR Description

lazyllm_cpp 扩展更新，涉及四个RAG相关类：DocNode、NodeTransform、_TextSplitterBase、SentenceSplitter。

架构设计（core / adaptor / binding）

core 层（csrc/core/include, csrc/core/src）

职责：纯 C++ 数据结构与算法实现，不依赖 Python 对象语义。
内容：DocNode、NodeTransform、TextSplitterBase、SentenceSplitter、Tokenizer 接口、split与merge策略。
目标：便于性能优化（string_view、并发）

adaptor 层（csrc/adaptor）

职责：Python 对象回调桥接（std::any 参数编解码、GIL 获取、统一调用入口）。
内容：AdaptorBaseWrapper、DocumentStore（缓存 Python 对象，回调wrapper）。
目标：把“跨语言调用机制”从算法和导出中抽离，避免 core/binding 出现重复桥接逻辑。

binding 层（csrc/binding）

职责：pybind11 导出、trampoline、Python 语义兼容（命名、参数、返回类型、kwargs 容忍）。
内容：export_doc_node.cpp、export_node_transform.cpp、export_text_splitter_base.cpp、export_sentence_splitter.cpp。
目标：Python 强耦合行为全部集中在 binding，core 保持稳定、可预测、可优化。

三方依赖（CMake 声明）

依赖声明位于 csrc/cmake/third_party.cmake

pybind11 ：C++/Python 绑定层实现。
Python3（Interpreter + Development）
xxHash：高性能哈希能力（如内容哈希相关路径）。
cpp_tiktoken：tokenizer 编解码能力（TiktokenTokenizer 后端）。
- 备注：其内部会拉起 pcre2 等传递依赖。
utf8proc ：Unicode 文本处理支持。
ThreadPool（header-only，本地引入）
- 位置：csrc/core/include/thread_pool.hpp
- 来源：progschj/ThreadPool（header-only）
- 用途：NodeTransform::batch_forward 并行执行。

设计哲学：core 与 binding 分离

本 PR 统一遵循“core 负责算法，binding 负责 Python 语义”的原则：

core 不关心 Python 的动态行为、kwargs、命名兼容与类型多态输入。
binding 通过 trampoline + lambda 适配实现 Python 体验一致性。
adaptor 处理跨语言回调机制，避免 core/binding 反复处理 GIL/std::any。

`string_view` 使用与未使用点

已使用 string_view 的加速点

TextSplitterBase::split_text 输入视图：csrc/core/src/text_splitter_base.cpp:31
递归切分与规则切分：split_recursive、split_by_functions、split_text_while_keeping_separator
静态分隔切分工具返回 vector<string_view>，减少中间拷贝。

当前未完全 string_view 化的点

merge_chunks 阶段输出仍为 vector<string>（需要所有权与后续 decode/拼接安全性）。
SentenceSplitter 合并时维护 Chunk.text（因为 overlap 回填、拼接、trim 都需要稳定可拥有字符串）。

TODO

Tokenizer::encode(string_view) 在 TiktokenTokenizer 内部仍有 std::string(view) 拷贝（csrc/core/include/tokenizer.hpp:33）。
未来若 tokenizer 后端支持真正的 string_view 零拷贝，可继续推进 merge 路径的零拷贝化，减少 materialization。
DocNode的声明周期现由NodeTransform管理，后续改为由各个node的parent管理（root除外）

进度

DocNode
NodeTransform
TextSplitterBase
SentenseSplitter

🔍 相关 Issue / Related Issue

✅ 变更类型 / Type of Change

修复 Bug / Bug fix (non-breaking change that fixes an issue)
新功能 / New feature (non-breaking change that adds functionality)
重构 / Refactor (no functionality change, code structure optimized)
重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
文档更新 / Documentation update (changes to docs only)
性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

⚡ 更新后的用法示例 / Usage After Update

# 示例 / Example

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

重构前 / Before:

重构后 / After:

⚠️ 注意事项 / Additional Notes

gemini-code-assist · 2026-02-07T06:16:33Z

Summary of Changes

Hello @CompromisedKiwi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial architectural shift by migrating several core RAG (Retrieval Augmented Generation) components, specifically DocNode, NodeTransform, and TextSplitterBase, from Python to C++. The primary motivation behind this migration is to significantly enhance the performance of compute-intensive operations within the RAG pipeline. The changes establish a robust hybrid Python-C++ framework, leveraging pybind11 for interoperability and a structured C++ project layout, ensuring that performance gains are achieved without sacrificing the flexibility of Python-based logic where appropriate.

Highlights

C++ Migration of Core RAG Components: Key components like DocNode, NodeTransform, and TextSplitterBase have been reimplemented in C++ to boost performance for compute-intensive tasks.
Hybrid Python-C++ Architecture: A new 'adaptor' layer (lazyllm_adaptor) facilitates seamless interaction, allowing C++ code to call back into Python for specific functionalities, maintaining flexibility.
Enhanced Build System: The CMake configuration has been significantly updated to support the new C++ codebase, manage external dependencies (pybind11, xxHash, sentencepiece) efficiently, and enforce modern C++ standards (C++17).

Changelog

.gitignore
- Added '.cache/' to ignored files.
csrc/CMakeLists.txt
- Updated C++ standard to C++17.
- Added '-Werror' and '-Wshadow' compile options.
- Refactored third-party dependency inclusion to 'cmake/third_party.cmake'.
- Adjusted source file globbing for 'lazyllm_core' to 'core/src/*.cpp'.
- Linked 'lazyllm_core' with 'xxhash' and 'sentencepiece'.
- Introduced 'lazyllm_adaptor' static library for Python callback mechanisms.
- Updated 'lazyllm_cpp' binding sources and linked it with 'lazyllm_adaptor'.
csrc/README.md
- Renamed from 'csrc/include/README.md'.
csrc/adaptor/adaptor.cpp
- New file, includes 'adaptor_base_wrapper.hpp' and 'document_store.hpp'.
csrc/adaptor/adaptor_base_wrapper.hpp
- New file, defines 'AdaptorBaseWrapper' for Python object callbacks.
csrc/adaptor/document_store.hpp
- New file, defines 'NodeGroup' and 'DocumentStore' for C++ interaction with Python document stores.
csrc/binding/export_add_doc_str.cpp
- Renamed from 'csrc/binding/doc.cpp'.
- Function 'exportDoc' renamed to 'exportAddDocStr'.
csrc/binding/export_doc_node.cpp
- New file, contains 'pybind11' bindings for 'lazyllm::DocNode'.
csrc/binding/export_node_transform.cpp
- New file, contains 'pybind11' bindings for 'lazyllm::NodeTransform'.
csrc/binding/export_text_splitter_base.cpp
- New file, contains 'pybind11' bindings for 'lazyllm::TextSplitterBase' and 'lazyllm::_TokenTextSplitter'.
csrc/binding/lazyllm.cpp
- Updated to include new binding export functions and remove old 'DocNode' bindings.
csrc/binding/lazyllm.hpp
- Updated to declare new binding export functions.
csrc/cmake/tests.cmake
- Updated 'googletest' URL to a specific commit hash.
csrc/cmake/third_party.cmake
- New file, centralizes 'FetchContent' for 'pybind11', 'xxHash', and 'sentencepiece'.
csrc/core/include/adaptor_base.hpp
- New file, defines 'AdaptorBase' for C++-Python callback interface.
csrc/core/include/doc_node.hpp
- New file, provides the C++ implementation of 'DocNode'.
csrc/core/include/node_transform.hpp
- New file, provides the C++ implementation of 'NodeTransform' with thread pool support.
csrc/core/include/text_splitter_base.hpp
- New file, provides the C++ implementation of 'TextSplitterBase' for text splitting.
csrc/core/include/thread_pool.hpp
- New file, implements a generic C++ thread pool.
csrc/core/include/tokenizer.hpp
- New file, defines 'Tokenizer' interface and 'SentencePieceTokenizer'.
csrc/core/include/utils.hpp
- New file, contains general utility functions and RAG metadata keys.
csrc/core/src/doc_node.cpp
- New file, includes 'doc_node.hpp'.
csrc/core/src/text_splitter_base.cpp
- New file, includes 'text_splitter_base.hpp'.
csrc/core/src/utils.cpp
- New file, includes 'utils.hpp'.
csrc/include/doc_node.h
- Removed old 'DocNode' header.
csrc/scripts/config_cmake.sh
- New script for CMake configuration.
csrc/src/doc_node.cpp
- Removed old 'DocNode' implementation file.
lazyllm/tools/rag/doc_node.py
- Changed '_embedding_state' to 'embedding_state' (public attribute).
lazyllm/tools/rag/utils.py
- Updated 'parallel_do_embedding' to directly access 'node.embedding_state'.

Ignored Files

Ignored by pattern: .github/workflows/** (2)
- .github/workflows/main.yml
- .github/workflows/publish_release.yml

Activity

The author 'CompromisedKiwi' initiated this feature branch to migrate Python RAG components to C++.
The pull request introduces a new C++ project structure under 'csrc/' with core logic, adaptor layer, and pybind11 bindings.
Dependencies like 'pybind11', 'xxHash', and 'sentencepiece' are now managed via CMake's 'FetchContent'.
The C++ standard has been upgraded to C++17.
The 'DocNode' and 'NodeTransform' components are fully migrated and bound, with 'TextSplitterBase' partially implemented.
Minor Python code adjustments were made to align with the new C++ 'DocNode' structure.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

本次 PR 将 DocNode / NodeTransform / TextSplitterBase / SentenceSplitter 等核心能力迁移到 C++ 扩展，以提升性能，这是一个很好的方向。代码结构清晰，分为了 core、adaptor 和 binding 三层，并且使用了 pybind11、xxHash、sentencepiece 等现代 C++ 库。然而，我发现了一些严重的问题需要合并前解决：

存在多处线程安全（竞态条件）和内存安全（悬垂指针）的隐患，可能导致程序崩溃或未定义行为。
部分 pybind11 绑定代码引用了未实现的 C++ 方法，会导致编译失败。
在 C++ 与 Python 交互的逻辑中存在一些参数和返回值处理的 bug。

我已经在代码中留下了具体的审查意见，请仔细查看。修复这些问题后，这将是一次非常有价值的性能优化贡献。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/adaptor/document_store.hpp

+            return func(
+                pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
+                pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
+                pybind11::arg("doc_ids") = std::vector<std::string>({std::any_cast<std::string>(args.at("doc_id"))})


此处存在两个问题：

从 args 中获取 doc_ids 时使用了错误的键 "doc_id"，应该为 "doc_ids"。

值的类型应该是 std::vector<std::string>，而不是 std::string。

这会导致 std::out_of_range 异常和类型转换错误。

pybind11::arg("doc_ids") = std::any_cast<const std::vector<std::string>&>(args.at("doc_ids"))

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/adaptor/document_store.hpp

+    static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> &store_cache() {
+        static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> cache;
+        return cache;
+    }


store_cache() 返回一个静态的 unordered_map，但在多线程环境下对它的访问（查找和插入）没有加锁，这会引发竞态条件。from_store 函数可能被多个线程同时调用，需要使用互斥锁来保护对 cache 的访问。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/binding/export_node_transform.cpp

+            [](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& {
+                if (name.is_none()) return self;
+                self.with_name(name.cast<std::string>(), copy);
+                return self;
+            },


此处绑定的 with_name 方法调用了 self.with_name(...)，但是在 C++ 类 lazyllm::NodeTransform 的定义（位于 node_transform.hpp）中并没有找到 with_name 这个方法。这将导致编译失败。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/binding/export_text_splitter_base.cpp

+        .def("from_sentencepiece_model", &lazyllm::TextSplitterBase::from_sentencepiece_model,
+            py::arg("model_path"), py::return_value_policy::reference)
+        .def("from_tokenizer",
+            [](lazyllm::TextSplitterBase& self, py::object tokenizer) -> lazyllm::TextSplitterBase& {
+                auto adaptor = std::make_shared<PyTokenizer>(tokenizer);
+                self.set_tokenizer(adaptor);
+                return self;
+            },


此处绑定的 from_sentencepiece_model 和 from_tokenizer 方法分别依赖 TextSplitterBase 类中的 from_sentencepiece_model 和 set_tokenizer 方法。然而，在 text_splitter_base.hpp 的 TextSplitterBase 类定义中，这两个方法都不存在，这将导致编译失败。PR 描述中提到 TextSplitterBase 仍在进行中，但这部分代码既然已经提交，就应该保证其完整性。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/core/include/text_splitter_base.hpp

+                end_split.token_size <= chunk_size - _overlap) {
+                const bool is_sentence = start_split.is_sentence && end_split.is_sentence;
+                const int token_size = start_split.token_size + end_split.token_size;
+                end_split = ChunkUnit{start_split.text + end_split.text, is_sentence, token_size};


在 _merge 函数中，start_split.text + end_split.text 会创建一个临时的 std::string 对象。然而，ChunkUnit 的 view 成员是 std::string_view 类型，它会指向这个临时字符串的内存。当该语句结束后，临时字符串被销毁，end_split.view 就成了一个悬垂指针（dangling pointer），后续对它的访问将导致未定义行为。在第 167 行也存在同样的问题。这是一个严重的内存安全问题。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/adaptor/document_store.hpp

+        else if (func_name == "get_node") {
+            return func(
+                pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
+                pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}),
+                pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
+                pybind11::arg("display") = true
+            ).cast<pybind11::list>()[0].cast<DocNode*>();
+        }


Python 函数返回的列表可能为空，直接使用 [0] 访问会导致程序崩溃。建议在访问前检查列表是否为空。

auto list = func( pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}), pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), pybind11::arg("display") = true ).cast<pybind11::list>(); if (list.empty()) { throw std::runtime_error("DocumentStore's get_node returned an empty list for uid: " + std::any_cast<std::string>(args.at("uid"))); } return list[0].cast<DocNode*>();

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/binding/export_doc_node.cpp

+    if (content) {
+        if (const auto* s = std::get_if<std::string>(&*content))
+            node.set_root_text(std::move(*s));
+        else
+            node.set_root_texts(std::get<std::vector<std::string>>(*content));
+    }
+    else if (text){
+        node.set_root_text(std::move(*text));
+    }


DocNode::set_root_text 的参数类型为 const std::string&&，这很罕见。当使用 std::move 一个 const 对象时，实际上会发生拷贝而非移动，这与 std::move 的意图相悖，并且容易引起误解。

建议在 doc_node.hpp 中为 set_root_text 提供 const std::string& 和 std::string&& 两个重载版本，以清晰地处理拷贝和移动两种情况，并提升代码可读性和效率。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/binding/export_doc_node.cpp

+        .def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) {
+            while (true) {
+                if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) {
+                    node._pending_embedding_keys.erase(key);
+                    break;
+                }
+                std::this_thread::sleep_for(std::chrono::seconds(1));
+            }
+        })


check_embedding_state 函数中使用了 while(true) 循环和 sleep，这是一种忙等待（busy-waiting）的方式，会阻塞当前 Python 线程。对于一个库来说，这种实现方式可能会严重影响调用方的应用程序性能。建议提供一个非阻塞的检查方法或异步机制。

gemini-code-assist · 2026-02-07T06:19:56Z

csrc/core/include/doc_node.hpp

+    const DocNode* get_root_node() const {
+        if (_p_parent_node == nullptr) return this;
+        return _p_parent_node->get_root_node();
+    }


get_root_node 方法的递归实现方式在节点层级很深的情况下，有导致栈溢出（stack overflow）的风险。建议修改为迭代实现，这样更安全也更高效。

Suggested change

const DocNode* get_root_node() const {

if (_p_parent_node == nullptr) return this;

return _p_parent_node->get_root_node();

}

const DocNode* get_root_node() const {

const DocNode* node = this;

while (node->_p_parent_node) {

node = node->_p_parent_node;

}

return node;

}

wzh1994 · 2026-03-09T02:31:22Z

/gemini review

gemini-code-assist

Code Review

This pull request significantly refactors the C++ backend for LazyLLM, primarily focusing on enhancing document node management and text splitting functionalities, and integrating these with Python. Key changes include upgrading the C++ standard to C++17, introducing a new lazyllm_adaptor library for C++-Python callback invocations, and extensively overhauling the DocNode class with new metadata, embedding, and parent-child relationship features. The build system (CMakeLists.txt) was updated to manage new internal libraries (lazyllm_core, lazyllm_adaptor) and external dependencies (xxhash, tiktoken, utf8proc) via a new third_party.cmake module, along with improved RPATH and installation rules. Python bindings were expanded to expose the new DocNode, NodeTransform, TextSplitterBase, and SentenceSplitter classes, enabling C++ overrides for performance-critical components. Test infrastructure was also improved to ensure correct libstdc++ linking for test executables. Review comments highlight several areas for improvement: the check_embedding_state function uses a busy-wait loop that could cause performance issues and unresponsiveness; the cpp_tiktoken dependency is pinned to master branch, which should be fixed to a specific commit for reproducibility; the ThreadPool utility lacks its original license; the C++ NodeTransform.with_name method ignores the copy parameter, leading to inconsistent behavior with its Python counterpart; and the TextSplitterBase.from_tiktoken_encoder binding ignores allowed_special and disallowed_special parameters, causing inconsistency with the Python API.

gemini-code-assist · 2026-03-09T02:35:24Z

csrc/binding/export_doc_node.cpp

+        .def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) {
+            while (true) {
+                if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) {
+                    node._pending_embedding_keys.erase(key);
+                    break;
+                }
+                std::this_thread::sleep_for(std::chrono::seconds(1));
+            }
+        })


check_embedding_state 函数使用了一个带有1秒休眠的 while(true) 循环。这是一种忙等待，会阻塞调用的 Python 线程，直到 embedding 准备就绪。这可能导致性能问题并使应用程序无响应。建议考虑使用异步机制，或者至少添加一个超时机制以防止无限期阻塞。

gemini-code-assist · 2026-03-09T02:35:24Z

csrc/cmake/third_party.cmake

+    FetchContent_Declare(
+        cpp_tiktoken
+        GIT_REPOSITORY https://github.com/gh-markt/cpp-tiktoken.git
+        GIT_TAG master
+    )


cpp_tiktoken 依赖是从 master 分支获取的。如果 master 分支发生变化，这可能会导致构建结果不可复现。建议将其固定到特定的 commit hash 或 release tag，以确保构建的稳定性。

GIT_REPOSITORY https://github.com/gh-markt/cpp-tiktoken.git GIT_TAG <在此处替换为具体的commit_hash或tag>

gemini-code-assist · 2026-03-09T02:35:24Z

csrc/core/include/thread_pool.hpp

+// https://github.com/progschj/ThreadPool
+
+#ifndef THREAD_POOL_H
+#define THREAD_POOL_H
+#include <vector>
+#include <queue>
+#include <memory>
+#include <thread>
+#include <mutex>
+#include <condition_variable>
+#include <future>
+#include <functional>
+#include <stdexcept>
+
+class ThreadPool {
+public:
+    ThreadPool(size_t);
+    template<class F, class... Args>
+    auto enqueue(F&& f, Args&&... args)
+        -> std::future<typename std::result_of<F(Args...)>::type>;
+    ~ThreadPool();
+private:
+    // need to keep track of threads so we can join them
+    std::vector< std::thread > workers;
+    // the task queue
+    std::queue< std::function<void()> > tasks;
+
+    // synchronization
+    std::mutex queue_mutex;
+    std::condition_variable condition;
+    bool stop;
+};
+
+// the constructor just launches some amount of workers
+inline ThreadPool::ThreadPool(size_t threads)
+    :   stop(false)
+{
+    for(size_t i = 0;i<threads;++i)
+        workers.emplace_back(
+            [this]
+            {
+                for(;;)
+                {
+                    std::function<void()> task;
+
+                    {
+                        std::unique_lock<std::mutex> lock(this->queue_mutex);
+                        this->condition.wait(lock,
+                            [this]{ return this->stop || !this->tasks.empty(); });
+                        if(this->stop && this->tasks.empty())
+                            return;
+                        task = std::move(this->tasks.front());
+                        this->tasks.pop();
+                    }
+
+                    task();
+                }
+            }
+        );
+}
+
+// add new work item to the pool
+template<class F, class... Args>
+auto ThreadPool::enqueue(F&& f, Args&&... args)
+    -> std::future<typename std::result_of<F(Args...)>::type>
+{
+    using return_type = typename std::result_of<F(Args...)>::type;
+
+    auto task = std::make_shared< std::packaged_task<return_type()> >(
+            std::bind(std::forward<F>(f), std::forward<Args>(args)...)
+        );
+
+    std::future<return_type> res = task->get_future();
+    {
+        std::unique_lock<std::mutex> lock(queue_mutex);
+
+        // don't allow enqueueing after stopping the pool
+        if(stop)
+            throw std::runtime_error("enqueue on stopped ThreadPool");
+
+        tasks.emplace([task](){ (*task)(); });
+    }
+    condition.notify_one();
+    return res;
+}
+
+// the destructor joins all threads
+inline ThreadPool::~ThreadPool()
+{
+    {
+        std::unique_lock<std::mutex> lock(queue_mutex);
+        stop = true;
+    }
+    condition.notify_all();
+    for(std::thread &worker: workers)
+        worker.join();
+}
+
+#endif


这个文件看起来是 progschj/ThreadPool 库的副本。但是，它缺少原始的许可证文件。在引入第三方代码时，包含其许可证至关重要，这既是为了遵守法律要求，也是为了尊重原作者。原始仓库使用的是 zlib 许可证。请在文件顶部添加许可证文本。

gemini-code-assist · 2026-03-09T02:35:24Z

csrc/binding/export_node_transform.cpp

+            "with_name",
+            [](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& {
+                (void)copy;
+                if (name.is_none()) return self;
+                self._name = name.cast<std::string>();
+                return self;
+            },


C++ 绑定中的 with_name 方法忽略了 copy 参数。然而，它所覆盖的 Python 实现 (lazyllm.tools.rag.transform.base.NodeTransform.with_name) 在 copy=True 时会创建对象的副本。这种不一致可能导致意外的副作用（在期望副本时修改了原始对象）。C++ 实现应该遵循 copy 标志。如果 copy=True，它应该返回一个新实例。由于 NodeTransform 是一个基类，这可能需要添加一个虚拟的 clone() 方法。

gemini-code-assist · 2026-03-09T02:35:24Z

csrc/binding/export_text_splitter_base.cpp

+        .def("from_tiktoken_encoder",
+            [](lazyllm::TextSplitterBase& self,
+               const std::string& encoding_name,
+               py::object model_name,
+               py::object /*allowed_special*/,
+               py::object /*disallowed_special*/,
+               py::kwargs /*kwargs*/) -> lazyllm::TextSplitterBase& {
+                if (model_name.is_none()) {
+                    return self.from_tiktoken_encoder(encoding_name, std::nullopt);
+                }
+                return self.from_tiktoken_encoder(encoding_name, model_name.cast<std::string>());
+            },
+            py::arg("encoding_name") = "gpt2",
+            py::arg("model_name") = py::none(),
+            py::arg("allowed_special") = py::none(),
+            py::arg("disallowed_special") = "all",
+            py::return_value_policy::reference
+        )


from_tiktoken_encoder 绑定忽略了 allowed_special 和 disallowed_special 参数。而 Python 实现 _TextSplitterBase 会将这些参数传递给 tiktoken.get_encoding(...).encode。目前的 C++ TiktokenTokenizer 似乎不支持这些选项。为了与 Python API 保持一致，C++ 实现也应该支持这些参数。

CompromisedKiwi added 20 commits January 20, 2026 19:02

workflow fix

e011555

coarce migration

bfb16fa

underline

54e43c1

c++17

894bb73

rename

b0a1ca0

save

8e7e017

Merge branch 'main' into yzh/migrate_doc_node

fa1d7d2

undo workflow fix

7ea108e

refactor

f0f6657

adaptor

1854448

finish doc_node init

1484fc7

children

a69f82f

doc_node hpp

a6cfceb

DocNode done

0170a0e

pending review

459cfd4

NodeTransform done

5ea167c

rename

e4070f8

save

6017ffa

Merge branch 'main' into yzh/migrate_doc_node

cc7ab7e

Module

615b7b0

CompromisedKiwi requested review from a team as code owners February 7, 2026 06:15

CompromisedKiwi had a problem deploying to protected February 7, 2026 06:18 — with GitHub Actions Error

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

map_params

0b193c8

CompromisedKiwi had a problem deploying to protected February 10, 2026 06:30 — with GitHub Actions Error

CompromisedKiwi added 2 commits February 10, 2026 18:21

save

0d88ea6

Integrate utf8proc to split text to readable chars.

02cbec4

CompromisedKiwi had a problem deploying to protected February 10, 2026 11:29 — with GitHub Actions Error

UnicodeProcessor

af7e617

CompromisedKiwi had a problem deploying to protected February 12, 2026 09:02 — with GitHub Actions Error

CompromisedKiwi added 2 commits February 13, 2026 15:00

text splitter base cpp finish

1c7ee82

keys

9ef9bd8

CompromisedKiwi had a problem deploying to protected February 13, 2026 07:05 — with GitHub Actions Error

export

068ca98

CompromisedKiwi had a problem deploying to protected February 13, 2026 08:20 — with GitHub Actions Error

sentence_splitter

19e00dd

CompromisedKiwi requested a deployment to protected February 13, 2026 09:05 — with GitHub Actions Waiting

CompromisedKiwi changed the title ~~[Feature] C++ Extension: DocNode, NodeTransform, TextSplitterBase, SentenseSplitter~~ [Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter Feb 13, 2026

compile_options

e0c3acc

CompromisedKiwi had a problem deploying to protected February 24, 2026 02:28 — with GitHub Actions Error

CompromisedKiwi added 4 commits February 24, 2026 14:55

tests in cpp side

06aa586

libstdc++.so.6

a214e35

DocNode manage itself.

e865ab6

finish cpp side tests

2fd8583

CompromisedKiwi requested review from a team as code owners March 2, 2026 11:26

CompromisedKiwi had a problem deploying to protected March 2, 2026 11:28 — with GitHub Actions Error

CompromisedKiwi added 6 commits March 4, 2026 16:27

cpp env switch

ac9dad3

no need to test cpp override

4ab5a93

cpp tests passed.

b38affc

merge

79218fb

install and third parties so.

ee3ecbc

Reuse python side tests.

42252a7

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

Conversation

CompromisedKiwi commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 PR 内容 / PR Description

架构设计（core / adaptor / binding）

三方依赖（CMake 声明）

设计哲学：core 与 binding 分离

string_view 使用与未使用点

TODO

进度

🔍 相关 Issue / Related Issue

✅ 变更类型 / Type of Change

🧪 如何测试 / How Has This Been Tested?

⚡ 更新后的用法示例 / Usage After Update

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

重构前 / Before:

重构后 / After:

⚠️ 注意事项 / Additional Notes

Uh oh!

gemini-code-assist bot commented Feb 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

wzh1994 commented Mar 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

CompromisedKiwi commented Feb 7, 2026 •

edited

Loading

`string_view` 使用与未使用点