refactor: replace DocToDbProcessor with SchemaExtractor#1026
refactor: replace DocToDbProcessor with SchemaExtractor#1026leon2526 wants to merge 5 commits intoLazyAGI:mainfrom
Conversation
Summary of ChangesHello @leon2526, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the internal mechanisms for connecting to SQL managers and extracting database schemas within the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the document processing logic to replace the legacy DocToDbProcessor with the new SchemaExtractor, which is a good step towards a more robust, Pydantic-based schema definition. The accompanying tests have also been updated to use pytest and cover the new implementation.
However, I've found a critical issue in lazyllm/tools/rag/document.py within the connect_sql_manager method. The logic for creating a SchemaExtractor instance on-the-fly is incorrect and will lead to a TypeError at runtime. The current test suite doesn't catch this because it always initializes the Document with a pre-configured SchemaExtractor instance. Please see my detailed comment in the file.
I recommend addressing this issue and considering adding a test case for when Document is initialized without a SchemaExtractor to prevent future regressions.
lazyllm/tools/rag/document.py
Outdated
|
|
||
| assert rows or schma, 'doc_table_schma must be given' | ||
|
|
||
| extractor = self._schema_extractor or SchemaExtractor(sql_manager) |
There was a problem hiding this comment.
This line introduces a critical issue with two failure modes:
-
If
self._schema_extractorisNone, the code attempts to instantiateSchemaExtractor(sql_manager). This will fail with aTypeErrorbecause theSchemaExtractorconstructor requiresdb_config(a dictionary) andllmas its first two positional arguments. The current call incorrectly passes anSqlManagerobject fordb_configand omits the requiredllm. -
If
self._schema_extractorwas initialized with anLLMBaseinstance,extractorhere would be thatLLMBaseinstance. The subsequent call tocompare_schemawould then fail with anAttributeErrorbecause it expects aSchemaExtractorinstance and calls methods likeregister_schema_set.
To fix this, the function must ensure it has a valid SchemaExtractor instance. This might involve ensuring self._impl._create_schema_extractor() is called beforehand and then consistently using the resulting extractor instance. The state management of _schema_extractor between Document and DocImpl may also need to be reviewed for consistency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
这个 PR 通过用新的 SchemaExtractor 类替换 DocToDbProcessor 来重构文档处理逻辑,并添加了相应的单元测试。这些变更是朝着更健壮的、具备模式感知能力的提取机制迈出的良好一步。
然而,我发现了一些需要解决的关键问题:
- 在
SchemaExtractor的初始化以及从Document调用它的方式上存在一些 bug,这将导致运行时错误。 Document.extract_db_schema中的llm参数现在被忽略了,这是一个功能上的退步。lazyllm/tools/rag/document.py中的update_database方法仍在使用旧的_doc_to_db_processor,没有被更新或移除,这将导致AttributeError。
请查看我的具体评论以获取详细信息和建议。
| if not isinstance(llm, LLMBase): | ||
| raise TypeError('llm must be an instance of LLMBase') |
There was a problem hiding this comment.
llm 参数现在是可选的,默认值为 None,但是这里的检查 if not isinstance(llm, LLMBase): 会在 llm 为 None 时引发 TypeError。这会阻止在没有 LLM 的情况下初始化 SchemaExtractor,而这似乎是本次重构的一个预期用例。这个检查应该被更新以处理 None 的情况。
| if not isinstance(llm, LLMBase): | |
| raise TypeError('llm must be an instance of LLMBase') | |
| if llm is not None and not isinstance(llm, LLMBase): |
lazyllm/tools/rag/document.py
Outdated
|
|
||
| assert rows or schma, 'doc_table_schma must be given' | ||
|
|
||
| extractor = self._schema_extractor or SchemaExtractor(sql_manager) |
There was a problem hiding this comment.
| self._db_config = { | ||
| 'db_type': sql_manager._db_type, | ||
| 'user': getattr(sql_manager, '_user', None), | ||
| 'password': getattr(sql_manager, '_password', None), | ||
| 'host': getattr(sql_manager, '_host', None), | ||
| 'port': getattr(sql_manager, '_port', None), | ||
| 'db_name': getattr(sql_manager, '_db_name', None), | ||
| 'options_str': getattr(sql_manager, '_options_str', None), | ||
| } |
| def extract_db_schema( | ||
| self, llm: Union[OnlineChatModule, TrainableModule], print_schema: bool = False | ||
| ) -> DocInfoSchema: | ||
| file_paths = self._list_all_files_in_dataset() | ||
| schema = extract_db_schema_from_files(file_paths, llm) | ||
| ) -> SchemaSetInfo: |
|
/gimini review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant refactoring that replaces the legacy DocToDbProcessor with the more flexible and powerful SchemaExtractor. The changes are well-structured, and the introduction of Pydantic models for schema definition is a great improvement. The new unit tests for SchemaExtractor are comprehensive and add confidence in the new implementation.
I have a few suggestions to improve encapsulation and security, mainly related to avoiding access to private attributes of other classes and using parameterized SQL queries to prevent potential injection vulnerabilities. Overall, this is a solid improvement to the codebase.
| self._db_config = { | ||
| 'db_type': sql_manager._db_type, | ||
| 'user': getattr(sql_manager, '_user', None), | ||
| 'password': getattr(sql_manager, '_password', None), | ||
| 'host': getattr(sql_manager, '_host', None), | ||
| 'port': getattr(sql_manager, '_port', None), | ||
| 'db_name': getattr(sql_manager, '_db_name', None), | ||
| 'options_str': getattr(sql_manager, '_options_str', None), | ||
| } |
There was a problem hiding this comment.
Accessing private attributes of SqlManager (e.g., _db_type, _user) makes SchemaExtractor tightly coupled to the internal implementation of SqlManager. This can lead to maintenance issues if SqlManager is refactored in the future.
Consider adding a public method to SqlManager, such as get_db_config(), to expose its configuration in a controlled way. This would improve encapsulation and make the code more robust and maintainable.
| raise ValueError(f'Invalid type for schema extractor: {type(self._schema_extractor)}') | ||
| if sql_manager == self._schema_extractor.sql_manager: | ||
| return self._schema_extractor | ||
| return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm) |
There was a problem hiding this comment.
Accessing the private attribute _llm from the _schema_extractor object breaks encapsulation. It's better to expose it via a public property in the SchemaExtractor class.
For example, you could add @property def llm(self): return self._llm to SchemaExtractor.
| return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm) | |
| return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor.llm) |
| sql_manager.execute_query( | ||
| f'select * from {bind_table} where algo_id=\'{algo_id}\' limit 1' | ||
| ) |
There was a problem hiding this comment.
The SQL query is constructed using an f-string, which can be vulnerable to SQL injection if algo_id contains malicious characters. While it seems safe in the context of this test where algo_id is controlled, it's a good practice to always use parameterized queries to prevent such vulnerabilities.
The sql_manager.execute_query method should ideally support query parameters. If it does, please refactor this to use them. For example, if it uses SQLAlchemy, it would look something like this:
from sqlalchemy import text
In SqlManager.execute_query
session.execute(text('... WHERE algo_id = :algo_id'), {'algo_id': algo_id})
If execute_query doesn't support parameters, it would be a good improvement to add that capability for security.
📌 PR 内容 / PR Description
Refactored the document processing logic by replacing the legacy DocToDbProcessor class with SchemaExtractor.
Implemented comprehensive unit tests using pytest to verify the functionality of SchemaExtractor.
Updated relevant references across the codebase to ensure compatibility.
🔍 相关 Issue / Related Issue
✅ 变更类型 / Type of Change
[ ] 修复 Bug / Bug fix (non-breaking change that fixes an issue)
[ ] 新功能 / New feature (non-breaking change that adds functionality)
[x] 重构 / Refactor (no functionality change, code structure optimized)
[ ] 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
[ ] 文档更新 / Documentation update (changes to docs only)
[ ] 性能优化 / Performance optimization
🧪 如何测试 / How Has This Been Tested?
Added new test cases in LazyLLM/tests/charge_tests/Tools/test_schema_extractor.py directory specifically for SchemaExtractor.
Ran pytest locally, and all tests passed.
📷 截图 / Demo (Optional)
⚡ 更新后的用法示例 / Usage After Update
🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)
重构前 / Before:
重构后 / After:
Please ensure that any downstream components relying on DocToDbProcessor are updated to use SchemaExtractor.
The DocToDbProcessor class has been deprecated in this PR.