Skip to content

[Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter#1022

Open
CompromisedKiwi wants to merge 39 commits intoLazyAGI:mainfrom
CompromisedKiwi:yzh/migrate_doc_node
Open

[Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter#1022
CompromisedKiwi wants to merge 39 commits intoLazyAGI:mainfrom
CompromisedKiwi:yzh/migrate_doc_node

Conversation

@CompromisedKiwi
Copy link
Collaborator

@CompromisedKiwi CompromisedKiwi commented Feb 7, 2026

📌 PR 内容 / PR Description

lazyllm_cpp 扩展更新,涉及四个RAG相关类:DocNodeNodeTransform_TextSplitterBaseSentenceSplitter

架构设计(core / adaptor / binding)

  1. core 层(csrc/core/include, csrc/core/src
  • 职责:纯 C++ 数据结构与算法实现,不依赖 Python 对象语义。
  • 内容:DocNodeNodeTransformTextSplitterBaseSentenceSplitterTokenizer 接口、split与merge策略。
  • 目标:便于性能优化(string_view、并发)
  1. adaptor 层(csrc/adaptor
  • 职责:Python 对象回调桥接(std::any 参数编解码、GIL 获取、统一调用入口)。
  • 内容:AdaptorBaseWrapperDocumentStore(缓存 Python 对象,回调wrapper)。
  • 目标:把“跨语言调用机制”从算法和导出中抽离,避免 core/binding 出现重复桥接逻辑。
  1. binding 层(csrc/binding
  • 职责:pybind11 导出、trampoline、Python 语义兼容(命名、参数、返回类型、kwargs 容忍)。
  • 内容:export_doc_node.cppexport_node_transform.cppexport_text_splitter_base.cppexport_sentence_splitter.cpp
  • 目标:Python 强耦合行为全部集中在 binding,core 保持稳定、可预测、可优化。

三方依赖(CMake 声明)

依赖声明位于 csrc/cmake/third_party.cmake

  1. pybind11 :C++/Python 绑定层实现。
  2. Python3(Interpreter + Development)
  3. xxHash:高性能哈希能力(如内容哈希相关路径)。
  4. cpp_tiktoken:tokenizer 编解码能力(TiktokenTokenizer 后端)。
    • 备注:其内部会拉起 pcre2 等传递依赖。
  5. utf8proc :Unicode 文本处理支持。
  6. ThreadPool(header-only,本地引入)
    • 位置:csrc/core/include/thread_pool.hpp
    • 来源:progschj/ThreadPool(header-only)
    • 用途:NodeTransform::batch_forward 并行执行。

设计哲学:core 与 binding 分离

本 PR 统一遵循“core 负责算法,binding 负责 Python 语义”的原则:

  1. core 不关心 Python 的动态行为、kwargs、命名兼容与类型多态输入。
  2. binding 通过 trampoline + lambda 适配实现 Python 体验一致性。
  3. adaptor 处理跨语言回调机制,避免 core/binding 反复处理 GIL/std::any

string_view 使用与未使用点

  1. 已使用 string_view 的加速点
  • TextSplitterBase::split_text 输入视图:csrc/core/src/text_splitter_base.cpp:31
  • 递归切分与规则切分:split_recursivesplit_by_functionssplit_text_while_keeping_separator
  • 静态分隔切分工具返回 vector<string_view>,减少中间拷贝。
  1. 当前未完全 string_view 化的点
  • merge_chunks 阶段输出仍为 vector<string>(需要所有权与后续 decode/拼接安全性)。
  • SentenceSplitter 合并时维护 Chunk.text(因为 overlap 回填、拼接、trim 都需要稳定可拥有字符串)。

TODO

  • Tokenizer::encode(string_view)TiktokenTokenizer 内部仍有 std::string(view) 拷贝(csrc/core/include/tokenizer.hpp:33)。
  • 未来若 tokenizer 后端支持真正的 string_view 零拷贝,可继续推进 merge 路径的零拷贝化,减少 materialization。
  • DocNode的声明周期现由NodeTransform管理,后续改为由各个node的parent管理(root除外)

进度

  • DocNode
  • NodeTransform
  • TextSplitterBase
  • SentenseSplitter

🔍 相关 Issue / Related Issue

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

⚡ 更新后的用法示例 / Usage After Update

# 示例 / Example

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

重构前 / Before:

重构后 / After:

⚠️ 注意事项 / Additional Notes

@CompromisedKiwi CompromisedKiwi requested review from a team as code owners February 7, 2026 06:15
@gemini-code-assist
Copy link

Summary of Changes

Hello @CompromisedKiwi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial architectural shift by migrating several core RAG (Retrieval Augmented Generation) components, specifically DocNode, NodeTransform, and TextSplitterBase, from Python to C++. The primary motivation behind this migration is to significantly enhance the performance of compute-intensive operations within the RAG pipeline. The changes establish a robust hybrid Python-C++ framework, leveraging pybind11 for interoperability and a structured C++ project layout, ensuring that performance gains are achieved without sacrificing the flexibility of Python-based logic where appropriate.

Highlights

  • C++ Migration of Core RAG Components: Key components like DocNode, NodeTransform, and TextSplitterBase have been reimplemented in C++ to boost performance for compute-intensive tasks.
  • Hybrid Python-C++ Architecture: A new 'adaptor' layer (lazyllm_adaptor) facilitates seamless interaction, allowing C++ code to call back into Python for specific functionalities, maintaining flexibility.
  • Enhanced Build System: The CMake configuration has been significantly updated to support the new C++ codebase, manage external dependencies (pybind11, xxHash, sentencepiece) efficiently, and enforce modern C++ standards (C++17).
Changelog
  • .gitignore
    • Added '.cache/' to ignored files.
  • csrc/CMakeLists.txt
    • Updated C++ standard to C++17.
    • Added '-Werror' and '-Wshadow' compile options.
    • Refactored third-party dependency inclusion to 'cmake/third_party.cmake'.
    • Adjusted source file globbing for 'lazyllm_core' to 'core/src/*.cpp'.
    • Linked 'lazyllm_core' with 'xxhash' and 'sentencepiece'.
    • Introduced 'lazyllm_adaptor' static library for Python callback mechanisms.
    • Updated 'lazyllm_cpp' binding sources and linked it with 'lazyllm_adaptor'.
  • csrc/README.md
    • Renamed from 'csrc/include/README.md'.
  • csrc/adaptor/adaptor.cpp
    • New file, includes 'adaptor_base_wrapper.hpp' and 'document_store.hpp'.
  • csrc/adaptor/adaptor_base_wrapper.hpp
    • New file, defines 'AdaptorBaseWrapper' for Python object callbacks.
  • csrc/adaptor/document_store.hpp
    • New file, defines 'NodeGroup' and 'DocumentStore' for C++ interaction with Python document stores.
  • csrc/binding/export_add_doc_str.cpp
    • Renamed from 'csrc/binding/doc.cpp'.
    • Function 'exportDoc' renamed to 'exportAddDocStr'.
  • csrc/binding/export_doc_node.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::DocNode'.
  • csrc/binding/export_node_transform.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::NodeTransform'.
  • csrc/binding/export_text_splitter_base.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::TextSplitterBase' and 'lazyllm::_TokenTextSplitter'.
  • csrc/binding/lazyllm.cpp
    • Updated to include new binding export functions and remove old 'DocNode' bindings.
  • csrc/binding/lazyllm.hpp
    • Updated to declare new binding export functions.
  • csrc/cmake/tests.cmake
    • Updated 'googletest' URL to a specific commit hash.
  • csrc/cmake/third_party.cmake
    • New file, centralizes 'FetchContent' for 'pybind11', 'xxHash', and 'sentencepiece'.
  • csrc/core/include/adaptor_base.hpp
    • New file, defines 'AdaptorBase' for C++-Python callback interface.
  • csrc/core/include/doc_node.hpp
    • New file, provides the C++ implementation of 'DocNode'.
  • csrc/core/include/node_transform.hpp
    • New file, provides the C++ implementation of 'NodeTransform' with thread pool support.
  • csrc/core/include/text_splitter_base.hpp
    • New file, provides the C++ implementation of 'TextSplitterBase' for text splitting.
  • csrc/core/include/thread_pool.hpp
    • New file, implements a generic C++ thread pool.
  • csrc/core/include/tokenizer.hpp
    • New file, defines 'Tokenizer' interface and 'SentencePieceTokenizer'.
  • csrc/core/include/utils.hpp
    • New file, contains general utility functions and RAG metadata keys.
  • csrc/core/src/doc_node.cpp
    • New file, includes 'doc_node.hpp'.
  • csrc/core/src/text_splitter_base.cpp
    • New file, includes 'text_splitter_base.hpp'.
  • csrc/core/src/utils.cpp
    • New file, includes 'utils.hpp'.
  • csrc/include/doc_node.h
    • Removed old 'DocNode' header.
  • csrc/scripts/config_cmake.sh
    • New script for CMake configuration.
  • csrc/src/doc_node.cpp
    • Removed old 'DocNode' implementation file.
  • lazyllm/tools/rag/doc_node.py
    • Changed '_embedding_state' to 'embedding_state' (public attribute).
  • lazyllm/tools/rag/utils.py
    • Updated 'parallel_do_embedding' to directly access 'node.embedding_state'.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/main.yml
    • .github/workflows/publish_release.yml
Activity
  • The author 'CompromisedKiwi' initiated this feature branch to migrate Python RAG components to C++.
  • The pull request introduces a new C++ project structure under 'csrc/' with core logic, adaptor layer, and pybind11 bindings.
  • Dependencies like 'pybind11', 'xxHash', and 'sentencepiece' are now managed via CMake's 'FetchContent'.
  • The C++ standard has been upgraded to C++17.
  • The 'DocNode' and 'NodeTransform' components are fully migrated and bound, with 'TextSplitterBase' partially implemented.
  • Minor Python code adjustments were made to align with the new C++ 'DocNode' structure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 将 DocNode / NodeTransform / TextSplitterBase / SentenceSplitter 等核心能力迁移到 C++ 扩展,以提升性能,这是一个很好的方向。代码结构清晰,分为了 core、adaptor 和 binding 三层,并且使用了 pybind11、xxHash、sentencepiece 等现代 C++ 库。然而,我发现了一些严重的问题需要合并前解决:

  • 存在多处线程安全(竞态条件)和内存安全(悬垂指针)的隐患,可能导致程序崩溃或未定义行为。
  • 部分 pybind11 绑定代码引用了未实现的 C++ 方法,会导致编译失败。
  • 在 C++ 与 Python 交互的逻辑中存在一些参数和返回值处理的 bug。

我已经在代码中留下了具体的审查意见,请仔细查看。修复这些问题后,这将是一次非常有价值的性能优化贡献。

return func(
pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
pybind11::arg("doc_ids") = std::vector<std::string>({std::any_cast<std::string>(args.at("doc_id"))})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

此处存在两个问题:

  1. args 中获取 doc_ids 时使用了错误的键 "doc_id",应该为 "doc_ids"
  2. 值的类型应该是 std::vector<std::string>,而不是 std::string

这会导致 std::out_of_range 异常和类型转换错误。

                pybind11::arg("doc_ids") = std::any_cast<const std::vector<std::string>&>(args.at("doc_ids"))

Comment on lines +113 to +116
static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> &store_cache() {
static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> cache;
return cache;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

store_cache() 返回一个静态的 unordered_map,但在多线程环境下对它的访问(查找和插入)没有加锁,这会引发竞态条件。from_store 函数可能被多个线程同时调用,需要使用互斥锁来保护对 cache 的访问。

Comment on lines +40 to +44
[](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& {
if (name.is_none()) return self;
self.with_name(name.cast<std::string>(), copy);
return self;
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

此处绑定的 with_name 方法调用了 self.with_name(...),但是在 C++ 类 lazyllm::NodeTransform 的定义(位于 node_transform.hpp)中并没有找到 with_name 这个方法。这将导致编译失败。

Comment on lines +64 to +71
.def("from_sentencepiece_model", &lazyllm::TextSplitterBase::from_sentencepiece_model,
py::arg("model_path"), py::return_value_policy::reference)
.def("from_tokenizer",
[](lazyllm::TextSplitterBase& self, py::object tokenizer) -> lazyllm::TextSplitterBase& {
auto adaptor = std::make_shared<PyTokenizer>(tokenizer);
self.set_tokenizer(adaptor);
return self;
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

此处绑定的 from_sentencepiece_modelfrom_tokenizer 方法分别依赖 TextSplitterBase 类中的 from_sentencepiece_modelset_tokenizer 方法。然而,在 text_splitter_base.hppTextSplitterBase 类定义中,这两个方法都不存在,这将导致编译失败。PR 描述中提到 TextSplitterBase 仍在进行中,但这部分代码既然已经提交,就应该保证其完整性。

end_split.token_size <= chunk_size - _overlap) {
const bool is_sentence = start_split.is_sentence && end_split.is_sentence;
const int token_size = start_split.token_size + end_split.token_size;
end_split = ChunkUnit{start_split.text + end_split.text, is_sentence, token_size};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

_merge 函数中,start_split.text + end_split.text 会创建一个临时的 std::string 对象。然而,ChunkUnitview 成员是 std::string_view 类型,它会指向这个临时字符串的内存。当该语句结束后,临时字符串被销毁,end_split.view 就成了一个悬垂指针(dangling pointer),后续对它的访问将导致未定义行为。在第 167 行也存在同样的问题。这是一个严重的内存安全问题。

Comment on lines +90 to +97
else if (func_name == "get_node") {
return func(
pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}),
pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
pybind11::arg("display") = true
).cast<pybind11::list>()[0].cast<DocNode*>();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Python 函数返回的列表可能为空,直接使用 [0] 访问会导致程序崩溃。建议在访问前检查列表是否为空。

            auto list = func(
                pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")),
                pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}),
                pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")),
                pybind11::arg("display") = true
            ).cast<pybind11::list>();
            if (list.empty()) {
                throw std::runtime_error("DocumentStore's get_node returned an empty list for uid: " + std::any_cast<std::string>(args.at("uid")));
            }
            return list[0].cast<DocNode*>();

Comment on lines +75 to +83
if (content) {
if (const auto* s = std::get_if<std::string>(&*content))
node.set_root_text(std::move(*s));
else
node.set_root_texts(std::get<std::vector<std::string>>(*content));
}
else if (text){
node.set_root_text(std::move(*text));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

DocNode::set_root_text 的参数类型为 const std::string&&,这很罕见。当使用 std::move 一个 const 对象时,实际上会发生拷贝而非移动,这与 std::move 的意图相悖,并且容易引起误解。

建议在 doc_node.hpp 中为 set_root_text 提供 const std::string&std::string&& 两个重载版本,以清晰地处理拷贝和移动两种情况,并提升代码可读性和效率。

Comment on lines +273 to +281
.def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) {
while (true) {
if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) {
node._pending_embedding_keys.erase(key);
break;
}
std::this_thread::sleep_for(std::chrono::seconds(1));
}
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

check_embedding_state 函数中使用了 while(true) 循环和 sleep,这是一种忙等待(busy-waiting)的方式,会阻塞当前 Python 线程。对于一个库来说,这种实现方式可能会严重影响调用方的应用程序性能。建议提供一个非阻塞的检查方法或异步机制。

Comment on lines +83 to +86
const DocNode* get_root_node() const {
if (_p_parent_node == nullptr) return this;
return _p_parent_node->get_root_node();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

get_root_node 方法的递归实现方式在节点层级很深的情况下,有导致栈溢出(stack overflow)的风险。建议修改为迭代实现,这样更安全也更高效。

Suggested change
const DocNode* get_root_node() const {
if (_p_parent_node == nullptr) return this;
return _p_parent_node->get_root_node();
}
const DocNode* get_root_node() const {
const DocNode* node = this;
while (node->_p_parent_node) {
node = node->_p_parent_node;
}
return node;
}

@CompromisedKiwi CompromisedKiwi changed the title [Feature] C++ Extension: DocNode, NodeTransform, TextSplitterBase, SentenseSplitter [Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter Feb 13, 2026
@wzh1994
Copy link
Contributor

wzh1994 commented Mar 9, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors the C++ backend for LazyLLM, primarily focusing on enhancing document node management and text splitting functionalities, and integrating these with Python. Key changes include upgrading the C++ standard to C++17, introducing a new lazyllm_adaptor library for C++-Python callback invocations, and extensively overhauling the DocNode class with new metadata, embedding, and parent-child relationship features. The build system (CMakeLists.txt) was updated to manage new internal libraries (lazyllm_core, lazyllm_adaptor) and external dependencies (xxhash, tiktoken, utf8proc) via a new third_party.cmake module, along with improved RPATH and installation rules. Python bindings were expanded to expose the new DocNode, NodeTransform, TextSplitterBase, and SentenceSplitter classes, enabling C++ overrides for performance-critical components. Test infrastructure was also improved to ensure correct libstdc++ linking for test executables. Review comments highlight several areas for improvement: the check_embedding_state function uses a busy-wait loop that could cause performance issues and unresponsiveness; the cpp_tiktoken dependency is pinned to master branch, which should be fixed to a specific commit for reproducibility; the ThreadPool utility lacks its original license; the C++ NodeTransform.with_name method ignores the copy parameter, leading to inconsistent behavior with its Python counterpart; and the TextSplitterBase.from_tiktoken_encoder binding ignores allowed_special and disallowed_special parameters, causing inconsistency with the Python API.

Comment on lines +273 to +281
.def("check_embedding_state", [](lazyllm::DocNode& node, const std::string& key) {
while (true) {
if (node._embedding_vecs.find(key) != node._embedding_vecs.end()) {
node._pending_embedding_keys.erase(key);
break;
}
std::this_thread::sleep_for(std::chrono::seconds(1));
}
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

check_embedding_state 函数使用了一个带有1秒休眠的 while(true) 循环。这是一种忙等待,会阻塞调用的 Python 线程,直到 embedding 准备就绪。这可能导致性能问题并使应用程序无响应。建议考虑使用异步机制,或者至少添加一个超时机制以防止无限期阻塞。

Comment on lines +22 to +26
FetchContent_Declare(
cpp_tiktoken
GIT_REPOSITORY https://github.com/gh-markt/cpp-tiktoken.git
GIT_TAG master
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

cpp_tiktoken 依赖是从 master 分支获取的。如果 master 分支发生变化,这可能会导致构建结果不可复现。建议将其固定到特定的 commit hash 或 release tag,以确保构建的稳定性。

        GIT_REPOSITORY https://github.com/gh-markt/cpp-tiktoken.git
        GIT_TAG <在此处替换为具体的commit_hash或tag>

Comment on lines +1 to +99
// https://github.com/progschj/ThreadPool

#ifndef THREAD_POOL_H
#define THREAD_POOL_H
#include <vector>
#include <queue>
#include <memory>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <future>
#include <functional>
#include <stdexcept>

class ThreadPool {
public:
ThreadPool(size_t);
template<class F, class... Args>
auto enqueue(F&& f, Args&&... args)
-> std::future<typename std::result_of<F(Args...)>::type>;
~ThreadPool();
private:
// need to keep track of threads so we can join them
std::vector< std::thread > workers;
// the task queue
std::queue< std::function<void()> > tasks;

// synchronization
std::mutex queue_mutex;
std::condition_variable condition;
bool stop;
};

// the constructor just launches some amount of workers
inline ThreadPool::ThreadPool(size_t threads)
: stop(false)
{
for(size_t i = 0;i<threads;++i)
workers.emplace_back(
[this]
{
for(;;)
{
std::function<void()> task;

{
std::unique_lock<std::mutex> lock(this->queue_mutex);
this->condition.wait(lock,
[this]{ return this->stop || !this->tasks.empty(); });
if(this->stop && this->tasks.empty())
return;
task = std::move(this->tasks.front());
this->tasks.pop();
}

task();
}
}
);
}

// add new work item to the pool
template<class F, class... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args)
-> std::future<typename std::result_of<F(Args...)>::type>
{
using return_type = typename std::result_of<F(Args...)>::type;

auto task = std::make_shared< std::packaged_task<return_type()> >(
std::bind(std::forward<F>(f), std::forward<Args>(args)...)
);

std::future<return_type> res = task->get_future();
{
std::unique_lock<std::mutex> lock(queue_mutex);

// don't allow enqueueing after stopping the pool
if(stop)
throw std::runtime_error("enqueue on stopped ThreadPool");

tasks.emplace([task](){ (*task)(); });
}
condition.notify_one();
return res;
}

// the destructor joins all threads
inline ThreadPool::~ThreadPool()
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
stop = true;
}
condition.notify_all();
for(std::thread &worker: workers)
worker.join();
}

#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这个文件看起来是 progschj/ThreadPool 库的副本。但是,它缺少原始的许可证文件。在引入第三方代码时,包含其许可证至关重要,这既是为了遵守法律要求,也是为了尊重原作者。原始仓库使用的是 zlib 许可证。请在文件顶部添加许可证文本。

Comment on lines +73 to +79
"with_name",
[](lazyllm::NodeTransform& self, py::object name, bool copy) -> lazyllm::NodeTransform& {
(void)copy;
if (name.is_none()) return self;
self._name = name.cast<std::string>();
return self;
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

C++ 绑定中的 with_name 方法忽略了 copy 参数。然而,它所覆盖的 Python 实现 (lazyllm.tools.rag.transform.base.NodeTransform.with_name) 在 copy=True 时会创建对象的副本。这种不一致可能导致意外的副作用(在期望副本时修改了原始对象)。C++ 实现应该遵循 copy 标志。如果 copy=True,它应该返回一个新实例。由于 NodeTransform 是一个基类,这可能需要添加一个虚拟的 clone() 方法。

Comment on lines +128 to +145
.def("from_tiktoken_encoder",
[](lazyllm::TextSplitterBase& self,
const std::string& encoding_name,
py::object model_name,
py::object /*allowed_special*/,
py::object /*disallowed_special*/,
py::kwargs /*kwargs*/) -> lazyllm::TextSplitterBase& {
if (model_name.is_none()) {
return self.from_tiktoken_encoder(encoding_name, std::nullopt);
}
return self.from_tiktoken_encoder(encoding_name, model_name.cast<std::string>());
},
py::arg("encoding_name") = "gpt2",
py::arg("model_name") = py::none(),
py::arg("allowed_special") = py::none(),
py::arg("disallowed_special") = "all",
py::return_value_policy::reference
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

from_tiktoken_encoder 绑定忽略了 allowed_specialdisallowed_special 参数。而 Python 实现 _TextSplitterBase 会将这些参数传递给 tiktoken.get_encoding(...).encode。目前的 C++ TiktokenTokenizer 似乎不支持这些选项。为了与 Python API 保持一致,C++ 实现也应该支持这些参数。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants