refactor: replace DocToDbProcessor with SchemaExtractor by leon2526 · Pull Request #1026 · LazyAGI/LazyLLM

leon2526 · 2026-02-10T12:00:07Z

📌 PR 内容 / PR Description

Refactored the document processing logic by replacing the legacy DocToDbProcessor class with SchemaExtractor.

Implemented comprehensive unit tests using pytest to verify the functionality of SchemaExtractor.

Updated relevant references across the codebase to ensure compatibility.

🔍 相关 Issue / Related Issue

✅ 变更类型 / Type of Change

[ ] 修复 Bug / Bug fix (non-breaking change that fixes an issue)

[ ] 新功能 / New feature (non-breaking change that adds functionality)

[x] 重构 / Refactor (no functionality change, code structure optimized)

[ ] 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)

[ ] 文档更新 / Documentation update (changes to docs only)

[ ] 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

Added new test cases in LazyLLM/tests/charge_tests/Tools/test_schema_extractor.py directory specifically for SchemaExtractor.

Ran pytest locally, and all tests passed.

📷 截图 / Demo (Optional)

⚡ 更新后的用法示例 / Usage After Update

# 1. Bind schema (Pydantic model)
documents.connect_sql_manager(sql_manager=extractor.sql_manager, schma=MySchema)

# 2. Extract and save to DB
documents.start()
documents.extract_db_schema(llm=llm)

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

重构前 / Before:

documents.connect_sql_manager(
        self,
        sql_manager: SqlManager,
        schma: Optional[DocInfoSchema] = None,
        force_refresh: bool = True,
    )

重构后 / After:

documents.connect_sql_manager(
        self,
        sql_manager: SqlManager,
        schma: Optional[BaseModel] = None, 
        force_refresh: bool = True,
    )

⚠️ 注意事项 / Additional Notes

Please ensure that any downstream components relying on DocToDbProcessor are updated to use SchemaExtractor.

The DocToDbProcessor class has been deprecated in this PR.

gemini-code-assist · 2026-02-10T12:00:30Z

Summary of Changes

Hello @leon2526, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the internal mechanisms for connecting to SQL managers and extracting database schemas within the lazyllm framework. By deprecating DocToDbProcessor in favor of SchemaExtractor, the change aims to modernize the data modeling approach, enhance modularity, and improve the overall reliability and testability of the schema extraction process. The update ensures that schema definitions are handled more explicitly and consistently across the codebase.

Highlights

Refactored Document Processing: The legacy DocToDbProcessor class has been replaced with the new SchemaExtractor for improved document processing logic.
Enhanced Schema Management: The Document class's connect_sql_manager method now accepts Pydantic BaseModel for schema definition, allowing for more robust and type-safe schema handling.
Comprehensive Unit Tests: New unit tests have been implemented using pytest to thoroughly verify the functionality and integration of the SchemaExtractor.

Changelog

lazyllm/tools/rag/doc_to_db/extractor.py
- Updated SchemaExtractor constructor to include an optional sql_manager parameter.
- Stored the sql_manager as an instance variable within SchemaExtractor.
- Modified _lazy_init to conditionally initialize _sql_manager only if it has not been provided during instantiation.
lazyllm/tools/rag/document.py
- Removed imports related to DocInfoSchema, DocToDbProcessor, and extract_db_schema_from_files.
- Added imports for SchemaSetInfo and Table_ALGO_KB_SCHEMA.
- Eliminated the _doc_to_db_processor instance variable.
- Refactored connect_sql_manager to accept BaseModel for schema, utilize SchemaExtractor for schema comparison and registration, and manage schema binding.
- Updated get_sql_manager to retrieve the SQL manager from the _schema_extractor instance.
- Modified extract_db_schema to call _forward('_analyze_schema_by_llm') and return SchemaSetInfo.
tests/charge_tests/Tools/test_doc_to_db.py
- Rewrote the entire test file from unittest to pytest for a more modern testing approach.
- Defined ReadingReportSchema using Pydantic BaseModel for structured schema testing.
- Introduced helper functions (_fetch_bind_row, _get_table_name, _get_count, _connect_and_get_table) to streamline test logic.
- Implemented setup_class, teardown_class, and setup_method for proper test environment management.
- Added new test cases to validate connect_sql_manager behavior, including schema requirement, bind creation, data extraction, and refresh logic.

Activity

The author, leon2526, initiated a refactoring effort to replace DocToDbProcessor with SchemaExtractor.
Comprehensive unit tests were developed and executed locally, with all tests passing, to ensure the correctness of the new SchemaExtractor functionality.
The pull request description includes 'Before' and 'After' code snippets illustrating the changes in the connect_sql_manager method signature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the document processing logic to replace the legacy DocToDbProcessor with the new SchemaExtractor, which is a good step towards a more robust, Pydantic-based schema definition. The accompanying tests have also been updated to use pytest and cover the new implementation.

However, I've found a critical issue in lazyllm/tools/rag/document.py within the connect_sql_manager method. The logic for creating a SchemaExtractor instance on-the-fly is incorrect and will lead to a TypeError at runtime. The current test suite doesn't catch this because it always initializes the Document with a pre-configured SchemaExtractor instance. Please see my detailed comment in the file.

I recommend addressing this issue and considering adding a test case for when Document is initialized without a SchemaExtractor to prevent future regressions.

gemini-code-assist · 2026-02-10T12:03:41Z

lazyllm/tools/rag/document.py

+
+        assert rows or schma, 'doc_table_schma must be given'
+
+        extractor = self._schema_extractor or SchemaExtractor(sql_manager)


This line introduces a critical issue with two failure modes:

If self._schema_extractor is None, the code attempts to instantiate SchemaExtractor(sql_manager). This will fail with a TypeError because the SchemaExtractor constructor requires db_config (a dictionary) and llm as its first two positional arguments. The current call incorrectly passes an SqlManager object for db_config and omits the required llm.

If self._schema_extractor was initialized with an LLMBase instance, extractor here would be that LLMBase instance. The subsequent call to compare_schema would then fail with an AttributeError because it expects a SchemaExtractor instance and calls methods like register_schema_set.

To fix this, the function must ensure it has a valid SchemaExtractor instance. This might involve ensuring self._impl._create_schema_extractor() is called beforehand and then consistently using the resulting extractor instance. The state management of _schema_extractor between Document and DocImpl may also need to be reviewed for consistency.

leon2526 · 2026-02-11T09:03:31Z

/gemini review

gemini-code-assist

Code Review

这个 PR 通过用新的 SchemaExtractor 类替换 DocToDbProcessor 来重构文档处理逻辑，并添加了相应的单元测试。这些变更是朝着更健壮的、具备模式感知能力的提取机制迈出的良好一步。

然而，我发现了一些需要解决的关键问题：

在 SchemaExtractor 的初始化以及从 Document 调用它的方式上存在一些 bug，这将导致运行时错误。
Document.extract_db_schema 中的 llm 参数现在被忽略了，这是一个功能上的退步。
lazyllm/tools/rag/document.py 中的 update_database 方法仍在使用旧的 _doc_to_db_processor，没有被更新或移除，这将导致 AttributeError。

请查看我的具体评论以获取详细信息和建议。

gemini-code-assist · 2026-02-11T09:05:44Z

lazyllm/tools/rag/doc_to_db/extractor.py

        if not isinstance(llm, LLMBase):
            raise TypeError('llm must be an instance of LLMBase')


llm 参数现在是可选的，默认值为 None，但是这里的检查 if not isinstance(llm, LLMBase): 会在 llm 为 None 时引发 TypeError。这会阻止在没有 LLM 的情况下初始化 SchemaExtractor，而这似乎是本次重构的一个预期用例。这个检查应该被更新以处理 None 的情况。

Suggested change

if not isinstance(llm, LLMBase):

raise TypeError('llm must be an instance of LLMBase')

if llm is not None and not isinstance(llm, LLMBase):

gemini-code-assist · 2026-02-11T09:05:45Z

lazyllm/tools/rag/document.py

+
+        assert rows or schma, 'doc_table_schma must be given'
+
+        extractor = self._schema_extractor or SchemaExtractor(sql_manager)


SchemaExtractor(sql_manager) 这个调用是错误的。在 SchemaExtractor.__init__ 中，sql_manager 是一个仅关键字参数（keyword-only argument），所以这个调用会引发 TypeError。即使它是一个位置参数，它也会被错误地赋给 db_config，从而在后续导致错误。这个调用应该使用关键字参数，例如 SchemaExtractor(sql_manager=sql_manager, ...)。此外，SchemaExtractor 初始化可能需要的 llm 实例没有被提供，这个问题也需要解决。

gemini-code-assist · 2026-02-11T09:05:45Z

lazyllm/tools/rag/doc_to_db/extractor.py

+            self._db_config = {
+                'db_type': sql_manager._db_type,
+                'user': getattr(sql_manager, '_user', None),
+                'password': getattr(sql_manager, '_password', None),
+                'host': getattr(sql_manager, '_host', None),
+                'port': getattr(sql_manager, '_port', None),
+                'db_name': getattr(sql_manager, '_db_name', None),
+                'options_str': getattr(sql_manager, '_options_str', None),
+            }


直接访问 sql_manager 的私有成员（例如 _db_type, _user）破坏了封装性，并使 SchemaExtractor 与 SqlManager 的内部实现紧密耦合。这将使得未来对 SqlManager 的修改变得困难且容易出错。更好的做法是让 SqlManager 提供一个公共方法（例如 get_config()）来暴露其配置细节。

gemini-code-assist · 2026-02-11T09:05:45Z

lazyllm/tools/rag/document.py

    def extract_db_schema(
        self, llm: Union[OnlineChatModule, TrainableModule], print_schema: bool = False
-    ) -> DocInfoSchema:
-        file_paths = self._list_all_files_in_dataset()
-        schema = extract_db_schema_from_files(file_paths, llm)
+    ) -> SchemaSetInfo:


重构后，extract_db_schema 方法中的 llm 参数不再被使用。该方法现在调用 _analyze_schema_by_llm，它依赖于在 Document 初始化时在 SchemaExtractor 中配置的 llm 实例。这对 API 用户来说是有误导性的，并且相比之前直接使用传入 llm 的行为是一种功能退步。请从方法签名中移除 llm 参数，或者更新实现来使用它。

leon2526 · 2026-02-11T09:32:43Z

/gimini review

leon2526 · 2026-02-11T10:10:31Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant refactoring that replaces the legacy DocToDbProcessor with the more flexible and powerful SchemaExtractor. The changes are well-structured, and the introduction of Pydantic models for schema definition is a great improvement. The new unit tests for SchemaExtractor are comprehensive and add confidence in the new implementation.

I have a few suggestions to improve encapsulation and security, mainly related to avoiding access to private attributes of other classes and using parameterized SQL queries to prevent potential injection vulnerabilities. Overall, this is a solid improvement to the codebase.

gemini-code-assist · 2026-02-11T10:12:31Z

lazyllm/tools/rag/doc_to_db/extractor.py

+            self._db_config = {
+                'db_type': sql_manager._db_type,
+                'user': getattr(sql_manager, '_user', None),
+                'password': getattr(sql_manager, '_password', None),
+                'host': getattr(sql_manager, '_host', None),
+                'port': getattr(sql_manager, '_port', None),
+                'db_name': getattr(sql_manager, '_db_name', None),
+                'options_str': getattr(sql_manager, '_options_str', None),
+            }


Accessing private attributes of SqlManager (e.g., _db_type, _user) makes SchemaExtractor tightly coupled to the internal implementation of SqlManager. This can lead to maintenance issues if SqlManager is refactored in the future.

Consider adding a public method to SqlManager, such as get_db_config(), to expose its configuration in a controlled way. This would improve encapsulation and make the code more robust and maintainable.

gemini-code-assist · 2026-02-11T10:12:31Z

lazyllm/tools/rag/document.py

+            raise ValueError(f'Invalid type for schema extractor: {type(self._schema_extractor)}')
+        if sql_manager == self._schema_extractor.sql_manager:
+            return self._schema_extractor
+        return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm)


Accessing the private attribute _llm from the _schema_extractor object breaks encapsulation. It's better to expose it via a public property in the SchemaExtractor class.

For example, you could add @property def llm(self): return self._llm to SchemaExtractor.

Suggested change

return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm)

return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor.llm)

gemini-code-assist · 2026-02-11T10:12:32Z

tests/charge_tests/Tools/test_schema_extractor.py

+        sql_manager.execute_query(
+            f'select * from {bind_table} where algo_id=\'{algo_id}\' limit 1'
+        )


The SQL query is constructed using an f-string, which can be vulnerable to SQL injection if algo_id contains malicious characters. While it seems safe in the context of this test where algo_id is controlled, it's a good practice to always use parameterized queries to prevent such vulnerabilities.

The sql_manager.execute_query method should ideally support query parameters. If it does, please refactor this to use them. For example, if it uses SQLAlchemy, it would look something like this:
from sqlalchemy import text

In SqlManager.execute_query

session.execute(text('... WHERE algo_id = :algo_id'), {'algo_id': algo_id})

If execute_query doesn't support parameters, it would be a good improvement to add that capability for security.

refactor: replace DocToDbProcessor with SchemaExtractor

df6cf9f

leon2526 requested a review from a team as a code owner February 10, 2026 12:00

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

liuqinda added 2 commits February 11, 2026 15:32

refactor: replace DocToDbProcessor with SchemaExtractor

79d5882

replace DocToDBProcessor with SchemaExtractor and fixed some mistake

ad14e6d

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

fix some error

14323da

fix lint

7a26eeb

leon2526 requested a deployment to protected February 11, 2026 10:08 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes


		assert rows or schma, 'doc_table_schma must be given'

		extractor = self._schema_extractor or SchemaExtractor(sql_manager)

		if not isinstance(llm, LLMBase):
		raise TypeError('llm must be an instance of LLMBase')

	if not isinstance(llm, LLMBase):
	raise TypeError('llm must be an instance of LLMBase')
	if llm is not None and not isinstance(llm, LLMBase):

	return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor._llm)
	return SchemaExtractor(sql_manager=sql_manager, llm=self._schema_extractor.llm)

Conversation

leon2526 commented Feb 10, 2026

📌 PR 内容 / PR Description

✅ 变更类型 / Type of Change

🧪 如何测试 / How Has This Been Tested?

📷 截图 / Demo (Optional)

⚡ 更新后的用法示例 / Usage After Update

🔄 重构前 / 重构后对比 (仅当 Type 为 Refactor) / Refactor Before & After (only for Refactor)

⚠️ 注意事项 / Additional Notes

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

leon2526 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

leon2526 commented Feb 11, 2026

Uh oh!

leon2526 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

In SqlManager.execute_query

session.execute(text('... WHERE algo_id = :algo_id'), {'algo_id': algo_id})

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant