-
Notifications
You must be signed in to change notification settings - Fork 364
[Feature] C++ Extension: Introduce DocNode, NodeTransform, TextSplitterBase, SentenseSplitter
#1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CompromisedKiwi
wants to merge
43
commits into
LazyAGI:main
Choose a base branch
from
CompromisedKiwi:yzh/migrate_doc_node
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 20 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
e011555
workflow fix
CompromisedKiwi bfb16fa
coarce migration
CompromisedKiwi 54e43c1
underline
CompromisedKiwi 894bb73
c++17
CompromisedKiwi b0a1ca0
rename
CompromisedKiwi 8e7e017
save
CompromisedKiwi fa1d7d2
Merge branch 'main' into yzh/migrate_doc_node
CompromisedKiwi 7ea108e
undo workflow fix
CompromisedKiwi f0f6657
refactor
CompromisedKiwi 1854448
adaptor
CompromisedKiwi 1484fc7
finish doc_node init
CompromisedKiwi a69f82f
children
CompromisedKiwi a6cfceb
doc_node hpp
CompromisedKiwi 0170a0e
DocNode done
CompromisedKiwi 459cfd4
pending review
CompromisedKiwi 5ea167c
NodeTransform done
CompromisedKiwi e4070f8
rename
CompromisedKiwi 6017ffa
save
CompromisedKiwi cc7ab7e
Merge branch 'main' into yzh/migrate_doc_node
CompromisedKiwi 615b7b0
Module
CompromisedKiwi 0b193c8
map_params
CompromisedKiwi 0d88ea6
save
CompromisedKiwi 02cbec4
Integrate utf8proc to split text to readable chars.
CompromisedKiwi af7e617
UnicodeProcessor
CompromisedKiwi 1c7ee82
text splitter base cpp finish
CompromisedKiwi 9ef9bd8
keys
CompromisedKiwi 068ca98
export
CompromisedKiwi 19e00dd
sentence_splitter
CompromisedKiwi e0c3acc
compile_options
CompromisedKiwi 06aa586
tests in cpp side
CompromisedKiwi a214e35
libstdc++.so.6
CompromisedKiwi e865ab6
DocNode manage itself.
CompromisedKiwi 2fd8583
finish cpp side tests
CompromisedKiwi ac9dad3
cpp env switch
CompromisedKiwi 4ab5a93
no need to test cpp override
CompromisedKiwi b38affc
cpp tests passed.
CompromisedKiwi 79218fb
merge
CompromisedKiwi ee3ecbc
install and third parties so.
CompromisedKiwi 42252a7
Reuse python side tests.
CompromisedKiwi 06eabd4
LD_PRELOAD
CompromisedKiwi fa73e50
feat: add cpp_class decorator for C++ class replacement
CompromisedKiwi 08f3333
docnode cpp ext repaired
CompromisedKiwi 2c893df
save
CompromisedKiwi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ test/ | |
| dist/ | ||
| tmp/ | ||
| build | ||
| .cache/ | ||
| *.lock | ||
| *.db | ||
| mkdocs.yml | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| #include "adaptor_base_wrapper.hpp" | ||
| #include "document_store.hpp" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #pragma once | ||
|
|
||
| #include <memory> | ||
| #include <mutex> | ||
| #include <string> | ||
| #include <unordered_map> | ||
| #include <vector> | ||
|
|
||
| #include <pybind11/pybind11.h> | ||
|
|
||
| #include "adaptor_base.hpp" | ||
|
|
||
|
|
||
| namespace lazyllm { | ||
|
|
||
| class LAZYLLM_HIDDEN AdaptorBaseWrapper : public AdaptorBase { | ||
| pybind11::object _py_obj; | ||
| public: | ||
| AdaptorBaseWrapper(const pybind11::object &obj) : _py_obj(obj) {} | ||
| virtual ~AdaptorBaseWrapper() = default; | ||
|
|
||
| std::any call( | ||
| const std::string& func_name, | ||
| const std::unordered_map<std::string, std::any>& args) const override final | ||
| { | ||
| pybind11::gil_scoped_acquire gil; | ||
| pybind11::object func = pybind11::getattr(_py_obj, func_name.c_str(), pybind11::none()); | ||
| return call_impl(func_name, func, args); | ||
| } | ||
|
|
||
| virtual std::any call_impl( | ||
| const std::string& func_name, | ||
| const pybind11::object& func, | ||
| const std::unordered_map<std::string, std::any>& args) const = 0; | ||
| }; | ||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| #pragma once | ||
|
|
||
| #include <memory> | ||
| #include <string> | ||
| #include <unordered_map> | ||
| #include <vector> | ||
|
|
||
| #include <pybind11/pybind11.h> | ||
| #include <pybind11/stl.h> | ||
|
|
||
| #include "adaptor_base_wrapper.hpp" | ||
| #include "doc_node.hpp" | ||
|
|
||
| namespace lazyllm { | ||
|
|
||
| struct NodeGroup { | ||
| enum class Type { | ||
| ORIGINAL, CHUNK, SUMMARY, IMAGE_INFO, QUESTION_ANSWER, OTHER | ||
| }; | ||
| std::string _parent; | ||
| std::string _display_name; | ||
| Type _type; | ||
| NodeGroup( | ||
| const std::string& parent, | ||
| const std::string& display_name, | ||
| const Type& type = Type::ORIGINAL) : | ||
| _parent(parent), _display_name(display_name), _type(type) {} | ||
| }; | ||
|
|
||
| class LAZYLLM_HIDDEN DocumentStore : public AdaptorBaseWrapper { | ||
| public: | ||
| DocumentStore() = delete; | ||
| explicit DocumentStore( | ||
| const pybind11::object& store, | ||
| const std::unordered_map<std::string, NodeGroup> &map) : | ||
| AdaptorBaseWrapper(store), _node_groups_map(map) {} | ||
|
|
||
| // Cache-aware factory to avoid rebuilding adaptor for the same Python store. | ||
| static std::shared_ptr<DocumentStore> from_store( | ||
| const pybind11::object& store, const std::unordered_map<std::string, NodeGroup>& map) { | ||
| if (store.is_none()) return nullptr; | ||
|
|
||
| pybind11::gil_scoped_acquire gil; | ||
| PyObject *key = store.ptr(); | ||
| auto &cache = store_cache(); | ||
| auto it = cache.find(key); | ||
| if (it != cache.end()) { | ||
| if (auto existing = it->second.lock()) | ||
| return existing; | ||
| } | ||
| auto created = std::make_shared<DocumentStore>(store, map); | ||
| cache[key] = created; | ||
| return created; | ||
| } | ||
|
|
||
| DocNode::Children get_node_children(const DocNode* node) const { | ||
| DocNode::Children out; | ||
| auto& kb_id = std::any_cast<std::string&>(node->_p_global_metadata->at(std::string(RAG_KEY_KB_ID))); | ||
| auto& doc_id = std::any_cast<std::string&>(node->_p_global_metadata->at(std::string(RAG_KEY_DOC_ID))); | ||
| auto& group_name = node->get_group_name(); | ||
| for(auto& [current_group_name, group] : _node_groups_map) { | ||
| if (group._parent != group_name) continue; | ||
| if (!std::any_cast<bool>(call("is_group_active", {{"group", current_group_name}}))) continue; | ||
| auto nodes_in_group = std::any_cast<std::vector<DocNode*>>(call("get_nodes", { | ||
| {"group_name", current_group_name}, | ||
| {"kb_id", kb_id}, | ||
| {"doc_ids", std::vector<std::string>({doc_id})} | ||
| })); | ||
|
|
||
| std::vector<DocNode*> children; | ||
| children.reserve(nodes_in_group.size()); | ||
| for (auto* n : nodes_in_group) | ||
| if (n->get_parent_node() == node) children.push_back(n); | ||
| out[current_group_name] = children; | ||
| } | ||
| return out; | ||
| } | ||
|
|
||
| private: | ||
| std::unordered_map<std::string, NodeGroup> _node_groups_map; | ||
|
|
||
| std::any call_impl( | ||
| const std::string& func_name, | ||
| const pybind11::object& func, | ||
| const std::unordered_map<std::string, std::any>& args) const override | ||
| { | ||
| if (func_name == "is_group_active") { | ||
| return func(args.at("group")).cast<bool>(); | ||
| } | ||
| else if (func_name == "get_node") { | ||
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("display") = true | ||
| ).cast<pybind11::list>()[0].cast<DocNode*>(); | ||
| } | ||
| else if (func_name == "get_nodes") { | ||
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("doc_ids") = std::vector<std::string>({std::any_cast<std::string>(args.at("doc_id"))}) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| ).cast<std::vector<DocNode*>>(); | ||
| } | ||
| else if (func_name == "get_node_children") { | ||
| return get_node_children(std::any_cast<DocNode*>(args.at("node"))); | ||
| } | ||
|
|
||
| throw std::runtime_error("Unknown DocumentStore function: " + func_name); | ||
| } | ||
|
|
||
| // Cache by Python object identity to ensure one wrapper per store instance. | ||
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> &store_cache() { | ||
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> cache; | ||
| return cache; | ||
| } | ||
|
Comment on lines
+113
to
+116
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| }; | ||
|
|
||
| } // namespace lazyllm | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 函数返回的列表可能为空,直接使用
[0]访问会导致程序崩溃。建议在访问前检查列表是否为空。