Make some optimizations to the Document module by wzh1994 · Pull Request #1009 · LazyAGI/LazyLLM

wzh1994 · 2026-02-04T08:04:29Z

📌 PR 内容 / PR Description

让SchemaExtractor继承自ModuleBase，以便模型的关联启动；同时在多个Document复用DocListManager时，让SchemaExtractor可以单独设置

✅ 变更类型 / Type of Change

修复 Bug / Bug fix (non-breaking change that fixes an issue)
新功能 / New feature (non-breaking change that adds functionality)
重构 / Refactor (no functionality change, code structure optimized)
重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
文档更新 / Documentation update (changes to docs only)
性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

单元测试

gemini-code-assist · 2026-02-04T08:04:46Z

Summary of Changes

Hello @wzh1994, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the Document module's architecture by integrating the SchemaExtractor more deeply into the lazyllm module system. This change enhances the manageability and associated startup of schema extraction processes within the RAG framework, providing greater flexibility in how document schemas are handled. Additionally, it introduces a new RichDocNode type and optimizes metadata handling for better code clarity and potential performance gains.

Highlights

SchemaExtractor Module Integration: The SchemaExtractor class now inherits from ModuleBase, allowing it to be managed and integrated more seamlessly within the lazyllm module system, enabling associated startup and lifecycle management.
Method Renaming for ModuleBase Compliance: The primary execution method in SchemaExtractor has been renamed from __call__ to forward, aligning with the standard interface for ModuleBase subclasses.
RichDocNode Introduction: A new document node type, RichDocNode, has been introduced and made available for use within the RAG module.
Enhanced Schema Extractor Management in Document Module: The Document module now registers schema_extractor instances as submodules and ensures they are properly passed down to DocImpl when adding new knowledge base groups, improving the flexibility and configurability of schema extraction.
Optimized Metadata Handling: The metadata initialization logic in parsing_service/impl.py has been refactored for improved conciseness and efficiency, using itertools.repeat for default metadata generation.

Changelog

lazyllm/tools/rag/init.py
- Imported RichDocNode from .doc_node.
- Added RichDocNode to the module's __all__ export list.
lazyllm/tools/rag/doc_to_db/extractor.py
- Imported ModuleBase from lazyllm.module.
- Modified SchemaExtractor to inherit from ModuleBase.
- Renamed the __call__ method to forward within SchemaExtractor.
lazyllm/tools/rag/document.py
- Added _schema_extractor as a registered submodule in the Document._Manager constructor.
- Introduced a new private helper method _register_submodules to centralize the registration of ModuleBase instances.
- Updated add_kb_group to accept and pass a schema_extractor argument to the DocImpl constructor.
- Removed the direct assignment of _schema_extractor in the Document class's __init__ method, now passing it through the _Manager constructor.
lazyllm/tools/rag/parsing_service/impl.py
- Imported repeat from itertools.
- Refactored the metadata initialization logic in the add_doc method to use itertools.repeat and dictionary merging for more efficient and cleaner code.

Activity

The author indicates that unit tests were performed to verify the changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request successfully integrates SchemaExtractor as a ModuleBase, which is a positive step towards better module management and lifecycle control within the lazyllm framework. The refactoring of submodule registration into a dedicated _register_submodules method enhances code reusability and maintains consistency. The changes also include a more concise approach to metadata handling in the parsing service, with an identified improvement for handling potential None values in metadata merging. Overall, these changes contribute to a more robust and maintainable codebase.

gemini-code-assist · 2026-02-04T08:07:50Z

lazyllm/tools/rag/parsing_service/impl.py

-                metadata.setdefault(RAG_KB_ID, kb_id or DEFAULT_KB_ID)
+            temp_metas = [{RAG_DOC_ID: doc_id, RAG_DOC_PATH: path, RAG_KB_ID: kb_id or DEFAULT_KB_ID}
+                          for doc_id, path in zip(ids, input_files)]
+            metadatas = [{**temp, **(metadata)} for metadata, temp in zip(metadatas or repeat({}), temp_metas)]


The current metadata merging logic might raise a TypeError if individual items within the metadatas list are None. For example, if metadatas is [None, {'user_key': 'user_value'}], the dictionary unpacking **(metadata) will fail for the None item. It's safer to ensure that metadata is always a dictionary before unpacking it.

Suggested change

metadatas = [{**temp, **(metadata)} for metadata, temp in zip(metadatas or repeat({}), temp_metas)]

metadatas = [{**temp, **(metadata or {})} for metadata, temp in zip(metadatas or repeat({}), temp_metas)]

fix schema_extractor bug

e772a42

wzh1994 requested a review from a team as a code owner February 4, 2026 08:04

wzh1994 had a problem deploying to protected February 4, 2026 08:06 — with GitHub Actions Error

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

wzh1994 added 2 commits February 6, 2026 21:50

temp

9c714de

Merge remote-tracking branch 'origin/main' into wzh/doc_process

3c417a3

wzh1994 requested a deployment to protected February 12, 2026 07:07 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make some optimizations to the Document module#1009

Make some optimizations to the Document module#1009
wzh1994 wants to merge 3 commits intoLazyAGI:mainfrom
wzh1994:wzh/doc_process

wzh1994 commented Feb 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	metadatas = [{temp, (metadata)} for metadata, temp in zip(metadatas or repeat({}), temp_metas)]
	metadatas = [{temp, (metadata or {})} for metadata, temp in zip(metadatas or repeat({}), temp_metas)]

Conversation

wzh1994 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 PR 内容 / PR Description

✅ 变更类型 / Type of Change

🧪 如何测试 / How Has This Been Tested?

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wzh1994 commented Feb 4, 2026 •

edited

Loading