[Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter#1022
[Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter#1022CompromisedKiwi wants to merge 39 commits intoLazyAGI:mainfrom
DocNode, NodeTransform, TextSplitterBase, SentenseSplitter#1022Conversation
Summary of ChangesHello @CompromisedKiwi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a substantial architectural shift by migrating several core RAG (Retrieval Augmented Generation) components, specifically DocNode, NodeTransform, and TextSplitterBase, from Python to C++. The primary motivation behind this migration is to significantly enhance the performance of compute-intensive operations within the RAG pipeline. The changes establish a robust hybrid Python-C++ framework, leveraging pybind11 for interoperability and a structured C++ project layout, ensuring that performance gains are achieved without sacrificing the flexibility of Python-based logic where appropriate. Highlights
Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本次 PR 将 DocNode / NodeTransform / TextSplitterBase / SentenceSplitter 等核心能力迁移到 C++ 扩展,以提升性能,这是一个很好的方向。代码结构清晰,分为了 core、adaptor 和 binding 三层,并且使用了 pybind11、xxHash、sentencepiece 等现代 C++ 库。然而,我发现了一些严重的问题需要合并前解决:
- 存在多处线程安全(竞态条件)和内存安全(悬垂指针)的隐患,可能导致程序崩溃或未定义行为。
- 部分 pybind11 绑定代码引用了未实现的 C++ 方法,会导致编译失败。
- 在 C++ 与 Python 交互的逻辑中存在一些参数和返回值处理的 bug。
我已经在代码中留下了具体的审查意见,请仔细查看。修复这些问题后,这将是一次非常有价值的性能优化贡献。
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("doc_ids") = std::vector<std::string>({std::any_cast<std::string>(args.at("doc_id"))}) |
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> &store_cache() { | ||
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> cache; | ||
| return cache; | ||
| } |
| [](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& { | ||
| if (name.is_none()) return self; | ||
| self.with_name(name.cast<std::string>(), copy); | ||
| return self; | ||
| }, |
| .def("from_sentencepiece_model", &lazyllm::TextSplitterBase::from_sentencepiece_model, | ||
| py::arg("model_path"), py::return_value_policy::reference) | ||
| .def("from_tokenizer", | ||
| [](lazyllm::TextSplitterBase& self, py::object tokenizer) -> lazyllm::TextSplitterBase& { | ||
| auto adaptor = std::make_shared<PyTokenizer>(tokenizer); | ||
| self.set_tokenizer(adaptor); | ||
| return self; | ||
| }, |
| end_split.token_size <= chunk_size - _overlap) { | ||
| const bool is_sentence = start_split.is_sentence && end_split.is_sentence; | ||
| const int token_size = start_split.token_size + end_split.token_size; | ||
| end_split = ChunkUnit{start_split.text + end_split.text, is_sentence, token_size}; |
| else if (func_name == "get_node") { | ||
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("display") = true | ||
| ).cast<pybind11::list>()[0].cast<DocNode*>(); | ||
| } |
There was a problem hiding this comment.
Python 函数返回的列表可能为空,直接使用 [0] 访问会导致程序崩溃。建议在访问前检查列表是否为空。
auto list = func(
pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}),
pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
pybind11::arg("display") = true
).cast<pybind11::list>();
if (list.empty()) {
throw std::runtime_error("DocumentStore's get_node returned an empty list for uid: " + std::any_cast<std::string>(args.at("uid")));
}
return list[0].cast<DocNode*>();| if (content) { | ||
| if (const auto* s = std::get_if<std::string>(&*content)) | ||
| node.set_root_text(std::move(*s)); | ||
| else | ||
| node.set_root_texts(std::get<std::vector<std::string>>(*content)); | ||
| } | ||
| else if (text){ | ||
| node.set_root_text(std::move(*text)); | ||
| } |
| .def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) { | ||
| while (true) { | ||
| if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) { | ||
| node._pending_embedding_keys.erase(key); | ||
| break; | ||
| } | ||
| std::this_thread::sleep_for(std::chrono::seconds(1)); | ||
| } | ||
| }) |
| const DocNode* get_root_node() const { | ||
| if (_p_parent_node == nullptr) return this; | ||
| return _p_parent_node->get_root_node(); | ||
| } |
There was a problem hiding this comment.
get_root_node 方法的递归实现方式在节点层级很深的情况下,有导致栈溢出(stack overflow)的风险。建议修改为迭代实现,这样更安全也更高效。
| const DocNode* get_root_node() const { | |
| if (_p_parent_node == nullptr) return this; | |
| return _p_parent_node->get_root_node(); | |
| } | |
| const DocNode* get_root_node() const { | |
| const DocNode* node = this; | |
| while (node->_p_parent_node) { | |
| node = node->_p_parent_node; | |
| } | |
| return node; | |
| } |
DocNode, NodeTransform, TextSplitterBase, SentenseSplitter
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the C++ backend for LazyLLM, primarily focusing on enhancing document node management and text splitting functionalities, and integrating these with Python. Key changes include upgrading the C++ standard to C++17, introducing a new lazyllm_adaptor library for C++-Python callback invocations, and extensively overhauling the DocNode class with new metadata, embedding, and parent-child relationship features. The build system (CMakeLists.txt) was updated to manage new internal libraries (lazyllm_core, lazyllm_adaptor) and external dependencies (xxhash, tiktoken, utf8proc) via a new third_party.cmake module, along with improved RPATH and installation rules. Python bindings were expanded to expose the new DocNode, NodeTransform, TextSplitterBase, and SentenceSplitter classes, enabling C++ overrides for performance-critical components. Test infrastructure was also improved to ensure correct libstdc++ linking for test executables. Review comments highlight several areas for improvement: the check_embedding_state function uses a busy-wait loop that could cause performance issues and unresponsiveness; the cpp_tiktoken dependency is pinned to master branch, which should be fixed to a specific commit for reproducibility; the ThreadPool utility lacks its original license; the C++ NodeTransform.with_name method ignores the copy parameter, leading to inconsistent behavior with its Python counterpart; and the TextSplitterBase.from_tiktoken_encoder binding ignores allowed_special and disallowed_special parameters, causing inconsistency with the Python API.
| .def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) { | ||
| while (true) { | ||
| if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) { | ||
| node._pending_embedding_keys.erase(key); | ||
| break; | ||
| } | ||
| std::this_thread::sleep_for(std::chrono::seconds(1)); | ||
| } | ||
| }) |
| FetchContent_Declare( | ||
| cpp_tiktoken | ||
| GIT_REPOSITORY https://github.com/gh-markt/cpp-tiktoken.git | ||
| GIT_TAG master | ||
| ) |
| // https://github.com/progschj/ThreadPool | ||
|
|
||
| #ifndef THREAD_POOL_H | ||
| #define THREAD_POOL_H | ||
| #include <vector> | ||
| #include <queue> | ||
| #include <memory> | ||
| #include <thread> | ||
| #include <mutex> | ||
| #include <condition_variable> | ||
| #include <future> | ||
| #include <functional> | ||
| #include <stdexcept> | ||
|
|
||
| class ThreadPool { | ||
| public: | ||
| ThreadPool(size_t); | ||
| template<class F, class... Args> | ||
| auto enqueue(F&& f, Args&&... args) | ||
| -> std::future<typename std::result_of<F(Args...)>::type>; | ||
| ~ThreadPool(); | ||
| private: | ||
| // need to keep track of threads so we can join them | ||
| std::vector< std::thread > workers; | ||
| // the task queue | ||
| std::queue< std::function<void()> > tasks; | ||
|
|
||
| // synchronization | ||
| std::mutex queue_mutex; | ||
| std::condition_variable condition; | ||
| bool stop; | ||
| }; | ||
|
|
||
| // the constructor just launches some amount of workers | ||
| inline ThreadPool::ThreadPool(size_t threads) | ||
| : stop(false) | ||
| { | ||
| for(size_t i = 0;i<threads;++i) | ||
| workers.emplace_back( | ||
| [this] | ||
| { | ||
| for(;;) | ||
| { | ||
| std::function<void()> task; | ||
|
|
||
| { | ||
| std::unique_lock<std::mutex> lock(this->queue_mutex); | ||
| this->condition.wait(lock, | ||
| [this]{ return this->stop || !this->tasks.empty(); }); | ||
| if(this->stop && this->tasks.empty()) | ||
| return; | ||
| task = std::move(this->tasks.front()); | ||
| this->tasks.pop(); | ||
| } | ||
|
|
||
| task(); | ||
| } | ||
| } | ||
| ); | ||
| } | ||
|
|
||
| // add new work item to the pool | ||
| template<class F, class... Args> | ||
| auto ThreadPool::enqueue(F&& f, Args&&... args) | ||
| -> std::future<typename std::result_of<F(Args...)>::type> | ||
| { | ||
| using return_type = typename std::result_of<F(Args...)>::type; | ||
|
|
||
| auto task = std::make_shared< std::packaged_task<return_type()> >( | ||
| std::bind(std::forward<F>(f), std::forward<Args>(args)...) | ||
| ); | ||
|
|
||
| std::future<return_type> res = task->get_future(); | ||
| { | ||
| std::unique_lock<std::mutex> lock(queue_mutex); | ||
|
|
||
| // don't allow enqueueing after stopping the pool | ||
| if(stop) | ||
| throw std::runtime_error("enqueue on stopped ThreadPool"); | ||
|
|
||
| tasks.emplace([task](){ (*task)(); }); | ||
| } | ||
| condition.notify_one(); | ||
| return res; | ||
| } | ||
|
|
||
| // the destructor joins all threads | ||
| inline ThreadPool::~ThreadPool() | ||
| { | ||
| { | ||
| std::unique_lock<std::mutex> lock(queue_mutex); | ||
| stop = true; | ||
| } | ||
| condition.notify_all(); | ||
| for(std::thread &worker: workers) | ||
| worker.join(); | ||
| } | ||
|
|
||
| #endif |
| "with_name", | ||
| [](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& { | ||
| (void)copy; | ||
| if (name.is_none()) return self; | ||
| self._name = name.cast<std::string>(); | ||
| return self; | ||
| }, |
| .def("from_tiktoken_encoder", | ||
| [](lazyllm::TextSplitterBase& self, | ||
| const std::string& encoding_name, | ||
| py::object model_name, | ||
| py::object /*allowed_special*/, | ||
| py::object /*disallowed_special*/, | ||
| py::kwargs /*kwargs*/) -> lazyllm::TextSplitterBase& { | ||
| if (model_name.is_none()) { | ||
| return self.from_tiktoken_encoder(encoding_name, std::nullopt); | ||
| } | ||
| return self.from_tiktoken_encoder(encoding_name, model_name.cast<std::string>()); | ||
| }, | ||
| py::arg("encoding_name") = "gpt2", | ||
| py::arg("model_name") = py::none(), | ||
| py::arg("allowed_special") = py::none(), | ||
| py::arg("disallowed_special") = "all", | ||
| py::return_value_policy::reference | ||
| ) |
📌 PR 内容 / PR Description
lazyllm_cpp扩展更新,涉及四个RAG相关类:DocNode、NodeTransform、_TextSplitterBase、SentenceSplitter。架构设计(core / adaptor / binding)
core层(csrc/core/include,csrc/core/src)DocNode、NodeTransform、TextSplitterBase、SentenceSplitter、Tokenizer接口、split与merge策略。string_view、并发)adaptor层(csrc/adaptor)std::any参数编解码、GIL 获取、统一调用入口)。AdaptorBaseWrapper、DocumentStore(缓存 Python 对象,回调wrapper)。binding层(csrc/binding)export_doc_node.cpp、export_node_transform.cpp、export_text_splitter_base.cpp、export_sentence_splitter.cpp。三方依赖(CMake 声明)
依赖声明位于
csrc/cmake/third_party.cmakepybind11:C++/Python 绑定层实现。Python3(Interpreter + Development)xxHash:高性能哈希能力(如内容哈希相关路径)。cpp_tiktoken:tokenizer 编解码能力(TiktokenTokenizer后端)。pcre2等传递依赖。utf8proc:Unicode 文本处理支持。ThreadPool(header-only,本地引入)csrc/core/include/thread_pool.hppprogschj/ThreadPool(header-only)NodeTransform::batch_forward并行执行。设计哲学:core 与 binding 分离
本 PR 统一遵循“core 负责算法,binding 负责 Python 语义”的原则:
kwargs、命名兼容与类型多态输入。std::any。string_view使用与未使用点string_view的加速点TextSplitterBase::split_text输入视图:csrc/core/src/text_splitter_base.cpp:31split_recursive、split_by_functions、split_text_while_keeping_separatorvector<string_view>,减少中间拷贝。string_view化的点merge_chunks阶段输出仍为vector<string>(需要所有权与后续 decode/拼接安全性)。SentenceSplitter合并时维护Chunk.text(因为 overlap 回填、拼接、trim 都需要稳定可拥有字符串)。TODO
Tokenizer::encode(string_view)在TiktokenTokenizer内部仍有std::string(view)拷贝(csrc/core/include/tokenizer.hpp:33)。string_view零拷贝,可继续推进 merge 路径的零拷贝化,减少 materialization。DocNode的声明周期现由NodeTransform管理,后续改为由各个node的parent管理(root除外)进度
🔍 相关 Issue / Related Issue
✅ 变更类型 / Type of Change
🧪 如何测试 / How Has This Been Tested?
⚡ 更新后的用法示例 / Usage After Update
# 示例 / Example🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)
重构前 / Before:
重构后 / After: