feat: integrate structured extraction, multimodal pipeline and role-b…#2799
Closed
feat: integrate structured extraction, multimodal pipeline and role-b…#2799
Conversation
…ased LLM routing - JSON structured entity extraction with native mode for OpenAI/Ollama/Gemini - Three-stage multimodal pipeline (PARSE → ANALYZE → PROCESS) with docx support - Deferred .docx upload to pipeline via pending_parse format - Interchange JSONL with para_id positions and *.blocks.assets image extraction - Per-role LLM/VLM routing (extract/keyword/query/vlm) with runtime updates - Role-specific provider options, cross-provider kwargs isolation - Relation merge robustness with VDB upsert timeouts - 460 offline tests pass Made-with: Cursor
Collaborator
|
此 PR 覆盖了主分支的的代码,我已经从新提交了一个 PR #2807 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR integrates a complete multimodal document pipeline with role-based LLM routing into LightRAG, building on the earlier JSON structured extraction work. It replaces synchronous API-layer document extraction with a proper three-stage pipeline, adds per-role model isolation for all four LLM roles (extract/keyword/query/VLM), and implements full DOCX multimodal support including image extraction, paragraph position tracking, and structured interchange format.
Changes Made
1. JSON Structured Entity Extraction
response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")ENTITY_EXTRACTION_USE_JSON(default: true)2. Three-Stage Multimodal Document Pipeline
PENDING → PARSING → ANALYZING → PROCESSING → PROCESSED/FAILEDLIGHTRAG_PARSERenv var and filename hints (supportsnative/mineru/docling)3. DOCX Upload Deferred to Pipeline
.docxuploads no longer do synchronous content extraction at the API layerpending_parseformat; the pipeline'sparse_native()handles structured extractionparse_docx_to_interchange_jsonlwith heading semantics, para_id positions, and image asset extraction4. Interchange JSONL Enhancements
positionsfield now populated withparaidentries from Wordw14:paraIdattributesengine_capabilitiesincludes"i"when embedded images are detectedasset_dirflag set when*.blocks.assetsdirectory is created5. Image Binary Extraction & Assets Directory
extract_docx_images()extracts embedded images viadoc.part.relsrelationship API*.blocks.assets/directory alongside interchange JSONL_extract_drawing_info()extended to returnr:embedrelationship IDParagraphdataclass extended withdrawing_rIdsfor image-paragraph association6. Role-Based LLM/VLM Routing
extract,keyword,query,vlm— each with its own function, concurrency queue, timeout, and provider optionsupdate_llm_role_config()with atomic rollback on failureoptions_dict_for_role()(e.g.,EXTRACT_OPENAI_LLM_TEMPERATURE)role_modelinstead of falling back to global configollamakwargs won't polluteopenairole calls/generateand/chatbypass paths usequery_llm_model_kwargs7. Relation Merge Robustness
Test Results
Offline Test Suite
End-to-End Test with Real Multimodal DOCX
Tested with Chapter 2 of a real academic paper (79 paragraphs, 6 tables, 6 embedded images, 25 OMML formulas):
extraction_formatinterchange_jsonl(notlegacy)format_version2.0engine_capabilities["t", "i"](tables + images)positionsfieldparaidentries (e.g.,["1D6D83BF", "68ED75D1"])*.blocks.assetsdirectorySample Q&A: Image Structure Details
Question: "图2-1中具体包含了哪些视觉元素和组成部分?请详细描述这张图的结构布局"
Answer (abridged):
Related Issues
Checklist