feat: integrate structured extraction, multimodal pipeline and role-b… by MrGidea · Pull Request #2799 · HKUDS/LightRAG

MrGidea · 2026-03-19T13:21:14Z

Description

This PR integrates a complete multimodal document pipeline with role-based LLM routing into LightRAG, building on the earlier JSON structured extraction work. It replaces synchronous API-layer document extraction with a proper three-stage pipeline, adds per-role model isolation for all four LLM roles (extract/keyword/query/VLM), and implements full DOCX multimodal support including image extraction, paragraph position tracking, and structured interchange format.

Changes Made

1. JSON Structured Entity Extraction

Replace delimiter-based extraction with JSON structured output for improved robustness
Support native JSON mode for OpenAI (response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")
Provider fallback logic when native JSON formatting is unsupported
Backward compatibility via ENTITY_EXTRACTION_USE_JSON (default: true)
Auto-detect JSON vs delimiter format during cache rebuild

2. Three-Stage Multimodal Document Pipeline

PARSE: Structured DOCX extraction with heading-aware semantic chunking, table structure preservation, and OMML formula extraction
ANALYZE: VLM-based multimodal analysis with sidecar writeback for drawings, tables, and equations
PROCESS: Entity/relation extraction from text and multimodal chunks, graph and vector store construction
Pipeline status tracking: PENDING → PARSING → ANALYZING → PROCESSING → PROCESSED/FAILED
Parser engine routing via LIGHTRAG_PARSER env var and filename hints (supports native/mineru/docling)

3. DOCX Upload Deferred to Pipeline

.docx uploads no longer do synchronous content extraction at the API layer
Files are enqueued with pending_parse format; the pipeline's parse_native() handles structured extraction
Supports parse_docx_to_interchange_jsonl with heading semantics, para_id positions, and image asset extraction

4. Interchange JSONL Enhancements

positions field now populated with paraid entries from Word w14:paraId attributes
engine_capabilities includes "i" when embedded images are detected
asset_dir flag set when *.blocks.assets directory is created

5. Image Binary Extraction & Assets Directory

New extract_docx_images() extracts embedded images via doc.part.rels relationship API
Images written to *.blocks.assets/ directory alongside interchange JSONL
_extract_drawing_info() extended to return r:embed relationship ID
Paragraph dataclass extended with drawing_rIds for image-paragraph association

6. Role-Based LLM/VLM Routing

Four independent roles: extract, keyword, query, vlm — each with its own function, concurrency queue, timeout, and provider options
Runtime reconfiguration via update_llm_role_config() with atomic rollback on failure
Per-role provider option overrides via options_dict_for_role() (e.g., EXTRACT_OPENAI_LLM_TEMPERATURE)
Ollama role functions now explicitly bind role_model instead of falling back to global config
Cross-provider kwargs isolation: base ollama kwargs won't pollute openai role calls
Ollama API /generate and /chat bypass paths use query_llm_model_kwargs

7. Relation Merge Robustness

Defensive timeout handling for relation VDB upserts
Fine-grained logging around entity/relation upsert stages
Improved observability for edge-processing waits and pending tasks

Test Results

Offline Test Suite

315 passed, 1 skipped, 0 failed (48.97s)

End-to-End Test with Real Multimodal DOCX

Tested with Chapter 2 of a real academic paper (79 paragraphs, 6 tables, 6 embedded images, 25 OMML formulas)：

Verification Point	Result
`extraction_format`	`interchange_jsonl` (not `legacy`)
`format_version`	`2.0`
`engine_capabilities`	`["t", "i"]` (tables + images)
`positions` field	Contains `paraid` entries (e.g., `["1D6D83BF", "68ED75D1"]`)
`*.blocks.assets` directory	Created with 6 extracted PNG images
Formula Q&A	Correctly explained formulas (2-1) and (2-2) for self-attention
Table Q&A	Correctly listed 8 comparison features from Table 2-1
Image Q&A	Correctly described all 6 figures with structural details
Image detail Q&A	Described CLIP model diagram components: dual encoders, N×N similarity matrix, arrows, color scheme

Sample Q&A: Image Structure Details

Question: "图2-1中具体包含了哪些视觉元素和组成部分？请详细描述这张图的结构布局"

Answer (abridged):

图2-1 以"CLIP 模型结构与对比学习"为主题，整体采用上下分区、左右对称的布局。上半部分为模型结构（左侧"图像编码器" + 右侧"文本编码器"双塔），下半部分为 N×N 相似度矩阵与对称交叉熵损失。垂直粗箭头从模型结构区指向矩阵，标注"计算余弦相似度"；循环虚线箭头标注"反向传播优化"形成训练闭环。图像侧使用蓝色调，文本侧使用绿色调...

Related Issues

Supersedes closed PR feat: Entity extraction uses JSON structured output instead of delimiter-based text #2684

Checklist

Changes tested locally
Pre-commit checks pass (ruff-format, ruff, trailing-whitespace, end-of-file, requirements-txt-fixer)
315 offline tests pass
End-to-end multimodal DOCX test verified (upload → parse → extract → query)
Unit tests added for role isolation, runtime updates, rollback, provider options, Ollama kwargs

…ased LLM routing - JSON structured entity extraction with native mode for OpenAI/Ollama/Gemini - Three-stage multimodal pipeline (PARSE → ANALYZE → PROCESS) with docx support - Deferred .docx upload to pipeline via pending_parse format - Interchange JSONL with para_id positions and *.blocks.assets image extraction - Per-role LLM/VLM routing (extract/keyword/query/vlm) with runtime updates - Role-specific provider options, cross-provider kwargs isolation - Relation merge robustness with VDB upsert timeouts - 460 offline tests pass Made-with: Cursor

danielaskdd · 2026-03-20T06:57:48Z

此 PR 覆盖了主分支的的代码，我已经从新提交了一个 PR #2807

danielaskdd added the tracked Issue is tracked by project label Mar 19, 2026

danielaskdd closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate structured extraction, multimodal pipeline and role-b…#2799

feat: integrate structured extraction, multimodal pipeline and role-b…#2799
MrGidea wants to merge 1 commit intoHKUDS:devfrom
MrGidea:feat/multimodal-pipeline-dev

MrGidea commented Mar 19, 2026 •

edited by danielaskdd

Loading

Uh oh!

danielaskdd commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MrGidea commented Mar 19, 2026 • edited by danielaskdd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

1. JSON Structured Entity Extraction

2. Three-Stage Multimodal Document Pipeline

3. DOCX Upload Deferred to Pipeline

4. Interchange JSONL Enhancements

5. Image Binary Extraction & Assets Directory

6. Role-Based LLM/VLM Routing

7. Relation Merge Robustness

Test Results

Offline Test Suite

End-to-End Test with Real Multimodal DOCX

Sample Q&A: Image Structure Details

Related Issues

Checklist

Uh oh!

danielaskdd commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGidea commented Mar 19, 2026 •

edited by danielaskdd

Loading