use pymupdf and pymupdf4llm parse pdf,suppost table and image #376

xiaohuangpin · 2025-10-23T01:15:30Z

Pull Request

描述 (Description)

The original PDF parser did not support image extraction. This new PDF parser uses the PyMuPDF library, which provides robust support for both text and images, and converts tables into Markdown tables.

变更类型 (Type of Change)

🐛 Bug 修复 (Bug fix)
[✔ ] ✨ 新功能 (New feature)
💥 破坏性变更 (Breaking change)
📚 文档更新 (Documentation update)
🎨 代码重构 (Code refactoring)
⚡ 性能优化 (Performance improvement)
🧪 测试相关 (Test related)
🔧 配置变更 (Configuration change)
🐳 Docker 相关 (Docker related)
🎨 前端 UI/UX (Frontend UI/UX)

影响范围 (Scope)

后端 API (Backend API)
前端界面 (Frontend UI)
数据库 (Database)
[✔ ] 文档解析服务 (Document Reader Service)
MCP 服务器 (MCP Server)
Docker 配置 (Docker Configuration)
配置文件 (Configuration)

测试 (Testing)

单元测试 (Unit tests)
集成测试 (Integration tests)
[✔ ] 手动测试 (Manual testing)
前端测试 (Frontend testing)
API 测试 (API testing)

检查清单 (Checklist)

代码遵循项目的编码规范
[✔ ] 已进行自我代码审查
代码变更已添加适当的注释
相关文档已更新
变更不会产生新的警告
已添加测试用例证明修复有效或功能正常
新功能和变更已更新到相关文档
破坏性变更已在描述中明确说明

测试结果截图/录屏 (Screenshots/Recordings)

数据库迁移 (Database Migration)

需要数据库迁移
[✔ ] 不需要数据库迁移

begoniezhao · 2025-10-23T10:20:57Z

尊敬的 xiaohuangpin，您好！
衷心感谢您提交的MR，其代码实现简洁高效，有效解决了PDF的痛点问题，令人印象深刻。然而，遗憾的是，我们暂时无法将其合入本项目。
原因在于，本项目采用MIT协议，而您所使用的pymupdf4llm的开源协议为AGPL 3.0，二者存在兼容性问题，导致我们无法直接使用。不过，请您放心，我们始终欢迎并期待您提交其他符合本项目协议要求的优质开源实现。
再次感谢您的付出与支持，期待与您进一步交流合作。

is911 · 2025-12-10T15:41:32Z

@xiaohuangpin Use Docling instead. It's MIT license and quite advanced.

begoniezhao · 2025-12-15T03:20:45Z

@is911 Welcome to contribute to the Docling version implementation. Pull requests are welcome.

use pymupdf and pymupdf4llm parse pdf,suppost table and image

843376c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use pymupdf and pymupdf4llm parse pdf,suppost table and image #376

use pymupdf and pymupdf4llm parse pdf,suppost table and image #376

xiaohuangpin commented Oct 23, 2025

Uh oh!

begoniezhao commented Oct 23, 2025

Uh oh!

is911 commented Dec 10, 2025

Uh oh!

begoniezhao commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

use pymupdf and pymupdf4llm parse pdf,suppost table and image #376

Are you sure you want to change the base?

use pymupdf and pymupdf4llm parse pdf,suppost table and image #376

Conversation

xiaohuangpin commented Oct 23, 2025

Pull Request

描述 (Description)

变更类型 (Type of Change)

影响范围 (Scope)

测试 (Testing)

检查清单 (Checklist)

测试结果截图/录屏 (Screenshots/Recordings)

数据库迁移 (Database Migration)

Uh oh!

begoniezhao commented Oct 23, 2025

Uh oh!

is911 commented Dec 10, 2025

Uh oh!

begoniezhao commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants