Skip to content

refactor: replace DocToDbProcessor with SchemaExtractor#1026

Open
leon2526 wants to merge 5 commits intoLazyAGI:mainfrom
leon2526:liuqinda
Open

refactor: replace DocToDbProcessor with SchemaExtractor#1026
leon2526 wants to merge 5 commits intoLazyAGI:mainfrom
leon2526:liuqinda

Conversation

@leon2526
Copy link

📌 PR 内容 / PR Description

Refactored the document processing logic by replacing the legacy DocToDbProcessor class with SchemaExtractor.

Implemented comprehensive unit tests using pytest to verify the functionality of SchemaExtractor.

Updated relevant references across the codebase to ensure compatibility.

🔍 相关 Issue / Related Issue

✅ 变更类型 / Type of Change

[ ] 修复 Bug / Bug fix (non-breaking change that fixes an issue)

[ ] 新功能 / New feature (non-breaking change that adds functionality)

[x] 重构 / Refactor (no functionality change, code structure optimized)

[ ] 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)

[ ] 文档更新 / Documentation update (changes to docs only)

[ ] 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

Added new test cases in LazyLLM/tests/charge_tests/Tools/test_schema_extractor.py directory specifically for SchemaExtractor.

Ran pytest locally, and all tests passed.

📷 截图 / Demo (Optional)

⚡ 更新后的用法示例 / Usage After Update

# 1. Bind schema (Pydantic model)
documents.connect_sql_manager(sql_manager=extractor.sql_manager, schma=MySchema)

# 2. Extract and save to DB
documents.start()
documents.extract_db_schema(llm=llm)

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

重构前 / Before:

documents.connect_sql_manager(
        self,
        sql_manager: SqlManager,
        schma: Optional[DocInfoSchema] = None,
        force_refresh: bool = True,
    )

重构后 / After:

documents.connect_sql_manager(
        self,
        sql_manager: SqlManager,
        schma: Optional[BaseModel] = None, 
        force_refresh: bool = True,
    )

⚠️ 注意事项 / Additional Notes

Please ensure that any downstream components relying on DocToDbProcessor are updated to use SchemaExtractor.

The DocToDbProcessor class has been deprecated in this PR.

@leon2526 leon2526 requested a review from a team as a code owner February 10, 2026 12:00
@gemini-code-assist
Copy link

Summary of Changes

Hello @leon2526, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the internal mechanisms for connecting to SQL managers and extracting database schemas within the lazyllm framework. By deprecating DocToDbProcessor in favor of SchemaExtractor, the change aims to modernize the data modeling approach, enhance modularity, and improve the overall reliability and testability of the schema extraction process. The update ensures that schema definitions are handled more explicitly and consistently across the codebase.

Highlights

  • Refactored Document Processing: The legacy DocToDbProcessor class has been replaced with the new SchemaExtractor for improved document processing logic.
  • Enhanced Schema Management: The Document class's connect_sql_manager method now accepts Pydantic BaseModel for schema definition, allowing for more robust and type-safe schema handling.
  • Comprehensive Unit Tests: New unit tests have been implemented using pytest to thoroughly verify the functionality and integration of the SchemaExtractor.
Changelog
  • lazyllm/tools/rag/doc_to_db/extractor.py
    • Updated SchemaExtractor constructor to include an optional sql_manager parameter.
    • Stored the sql_manager as an instance variable within SchemaExtractor.
    • Modified _lazy_init to conditionally initialize _sql_manager only if it has not been provided during instantiation.
  • lazyllm/tools/rag/document.py
    • Removed imports related to DocInfoSchema, DocToDbProcessor, and extract_db_schema_from_files.
    • Added imports for SchemaSetInfo and Table_ALGO_KB_SCHEMA.
    • Eliminated the _doc_to_db_processor instance variable.
    • Refactored connect_sql_manager to accept BaseModel for schema, utilize SchemaExtractor for schema comparison and registration, and manage schema binding.
    • Updated get_sql_manager to retrieve the SQL manager from the _schema_extractor instance.
    • Modified extract_db_schema to call _forward('_analyze_schema_by_llm') and return SchemaSetInfo.
  • tests/charge_tests/Tools/test_doc_to_db.py
    • Rewrote the entire test file from unittest to pytest for a more modern testing approach.
    • Defined ReadingReportSchema using Pydantic BaseModel for structured schema testing.
    • Introduced helper functions (_fetch_bind_row, _get_table_name, _get_count, _connect_and_get_table) to streamline test logic.
    • Implemented setup_class, teardown_class, and setup_method for proper test environment management.
    • Added new test cases to validate connect_sql_manager behavior, including schema requirement, bind creation, data extraction, and refresh logic.
Activity
  • The author, leon2526, initiated a refactoring effort to replace DocToDbProcessor with SchemaExtractor.
  • Comprehensive unit tests were developed and executed locally, with all tests passing, to ensure the correctness of the new SchemaExtractor functionality.
  • The pull request description includes 'Before' and 'After' code snippets illustrating the changes in the connect_sql_manager method signature.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the document processing logic to replace the legacy DocToDbProcessor with the new SchemaExtractor, which is a good step towards a more robust, Pydantic-based schema definition. The accompanying tests have also been updated to use pytest and cover the new implementation.

However, I've found a critical issue in lazyllm/tools/rag/document.py within the connect_sql_manager method. The logic for creating a SchemaExtractor instance on-the-fly is incorrect and will lead to a TypeError at runtime. The current test suite doesn't catch this because it always initializes the Document with a pre-configured SchemaExtractor instance. Please see my detailed comment in the file.

I recommend addressing this issue and considering adding a test case for when Document is initialized without a SchemaExtractor to prevent future regressions.


assert rows or schma, 'doc_table_schma must be given'

extractor = self._schema_extractor or SchemaExtractor(sql_manager)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line introduces a critical issue with two failure modes:

  1. If self._schema_extractor is None, the code attempts to instantiate SchemaExtractor(sql_manager). This will fail with a TypeError because the SchemaExtractor constructor requires db_config (a dictionary) and llm as its first two positional arguments. The current call incorrectly passes an SqlManager object for db_config and omits the required llm.

  2. If self._schema_extractor was initialized with an LLMBase instance, extractor here would be that LLMBase instance. The subsequent call to compare_schema would then fail with an AttributeError because it expects a SchemaExtractor instance and calls methods like register_schema_set.

To fix this, the function must ensure it has a valid SchemaExtractor instance. This might involve ensuring self._impl._create_schema_extractor() is called beforehand and then consistently using the resulting extractor instance. The state management of _schema_extractor between Document and DocImpl may also need to be reviewed for consistency.

@leon2526
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

这个 PR 通过用新的 SchemaExtractor 类替换 DocToDbProcessor 来重构文档处理逻辑,并添加了相应的单元测试。这些变更是朝着更健壮的、具备模式感知能力的提取机制迈出的良好一步。

然而,我发现了一些需要解决的关键问题:

  • SchemaExtractor 的初始化以及从 Document 调用它的方式上存在一些 bug,这将导致运行时错误。
  • Document.extract_db_schema 中的 llm 参数现在被忽略了,这是一个功能上的退步。
  • lazyllm/tools/rag/document.py 中的 update_database 方法仍在使用旧的 _doc_to_db_processor,没有被更新或移除,这将导致 AttributeError

请查看我的具体评论以获取详细信息和建议。

Comment on lines 74 to 75
if not isinstance(llm, LLMBase):
raise TypeError('llm must be an instance of LLMBase')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

llm 参数现在是可选的,默认值为 None,但是这里的检查 if not isinstance(llm, LLMBase): 会在 llmNone 时引发 TypeError。这会阻止在没有 LLM 的情况下初始化 SchemaExtractor,而这似乎是本次重构的一个预期用例。这个检查应该被更新以处理 None 的情况。

Suggested change
if not isinstance(llm, LLMBase):
raise TypeError('llm must be an instance of LLMBase')
if llm is not None and not isinstance(llm, LLMBase):


assert rows or schma, 'doc_table_schma must be given'

extractor = self._schema_extractor or SchemaExtractor(sql_manager)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

SchemaExtractor(sql_manager) 这个调用是错误的。在 SchemaExtractor.__init__ 中,sql_manager 是一个仅关键字参数(keyword-only argument),所以这个调用会引发 TypeError。即使它是一个位置参数,它也会被错误地赋给 db_config,从而在后续导致错误。这个调用应该使用关键字参数,例如 SchemaExtractor(sql_manager=sql_manager, ...)。此外,SchemaExtractor 初始化可能需要的 llm 实例没有被提供,这个问题也需要解决。

Comment on lines +81 to +89
self._db_config = {
'db_type': sql_manager._db_type,
'user': getattr(sql_manager, '_user', None),
'password': getattr(sql_manager, '_password', None),
'host': getattr(sql_manager, '_host', None),
'port': getattr(sql_manager, '_port', None),
'db_name': getattr(sql_manager, '_db_name', None),
'options_str': getattr(sql_manager, '_options_str', None),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

直接访问 sql_manager 的私有成员(例如 _db_type, _user)破坏了封装性,并使 SchemaExtractorSqlManager 的内部实现紧密耦合。这将使得未来对 SqlManager 的修改变得困难且容易出错。更好的做法是让 SqlManager 提供一个公共方法(例如 get_config())来暴露其配置细节。

Comment on lines 293 to +295
def extract_db_schema(
self, llm: Union[OnlineChatModule, TrainableModule], print_schema: bool = False
) -> DocInfoSchema:
file_paths = self._list_all_files_in_dataset()
schema = extract_db_schema_from_files(file_paths, llm)
) -> SchemaSetInfo:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

重构后,extract_db_schema 方法中的 llm 参数不再被使用。该方法现在调用 _analyze_schema_by_llm,它依赖于在 Document 初始化时在 SchemaExtractor 中配置的 llm 实例。这对 API 用户来说是有误导性的,并且相比之前直接使用传入 llm 的行为是一种功能退步。请从方法签名中移除 llm 参数,或者更新实现来使用它。

@leon2526
Copy link
Author

/gimini review

@leon2526
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring that replaces the legacy DocToDbProcessor with the more flexible and powerful SchemaExtractor. The changes are well-structured, and the introduction of Pydantic models for schema definition is a great improvement. The new unit tests for SchemaExtractor are comprehensive and add confidence in the new implementation.

I have a few suggestions to improve encapsulation and security, mainly related to avoiding access to private attributes of other classes and using parameterized SQL queries to prevent potential injection vulnerabilities. Overall, this is a solid improvement to the codebase.

Comment on lines +81 to +89
self._db_config = {
'db_type': sql_manager._db_type,
'user': getattr(sql_manager, '_user', None),
'password': getattr(sql_manager, '_password', None),
'host': getattr(sql_manager, '_host', None),
'port': getattr(sql_manager, '_port', None),
'db_name': getattr(sql_manager, '_db_name', None),
'options_str': getattr(sql_manager, '_options_str', None),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing private attributes of SqlManager (e.g., _db_type, _user) makes SchemaExtractor tightly coupled to the internal implementation of SqlManager. This can lead to maintenance issues if SqlManager is refactored in the future.

Consider adding a public method to SqlManager, such as get_db_config(), to expose its configuration in a controlled way. This would improve encapsulation and make the code more robust and maintainable.

raise ValueError(f'Invalid type for schema extractor: {type(self._schema_extractor)}')
if sql_manager == self._schema_extractor.sql_manager:
return self._schema_extractor
return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the private attribute _llm from the _schema_extractor object breaks encapsulation. It's better to expose it via a public property in the SchemaExtractor class.

For example, you could add @property def llm(self): return self._llm to SchemaExtractor.

Suggested change
return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm)
return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor.llm)

Comment on lines +43 to +45
sql_manager.execute_query(
f'select * from {bind_table} where algo_id=\'{algo_id}\' limit 1'
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SQL query is constructed using an f-string, which can be vulnerable to SQL injection if algo_id contains malicious characters. While it seems safe in the context of this test where algo_id is controlled, it's a good practice to always use parameterized queries to prevent such vulnerabilities.

The sql_manager.execute_query method should ideally support query parameters. If it does, please refactor this to use them. For example, if it uses SQLAlchemy, it would look something like this:
from sqlalchemy import text

In SqlManager.execute_query

session.execute(text('... WHERE algo_id = :algo_id'), {'algo_id': algo_id})

If execute_query doesn't support parameters, it would be a good improvement to add that capability for security.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant