High-performance Python libraries for PDF processing, data extraction, and LLM document pipelines.
This organisation maintains Python libraries for working with PDF and other document formats — from low-level manipulation to LLM-ready data extraction.
PyMuPDF — core library
The foundation of everything here. PyMuPDF wraps the MuPDF C engine and exposes a full Python API for reading, rendering, editing, and converting PDF, XPS, EPUB, MOBI, CBZ, and image files.
pip install pymupdfimport pymupdf
doc = pymupdf.open("report.pdf")
page = doc[0]
print(page.get_text()) # extract text
pix = page.get_pixmap(dpi=150) # render to image
pix.save("page.png")Key capabilities: text and image extraction · page rendering at any DPI · annotation create/edit/delete · redaction · PDF creation and merging · encryption · OCR via Tesseract · form fields · 10+ output formats
→ Documentation · PyPI · Changelog
pymupdf4llm — PDF → LLM-ready data
Turn any document into clean, structured data for RAG pipelines, vector stores, and LLM ingestion — in one line. No GPU, no cloud, no tokens required.
pip install pymupdf4llmimport pymupdf4llm
md = pymupdf4llm.to_markdown("paper.pdf") # Markdown
data = pymupdf4llm.to_json("paper.pdf") # JSON with bboxes
text = pymupdf4llm.to_text("paper.pdf") # plain textKey capabilities: layout-aware extraction · multi-column reading order · table detection → Markdown · smart hybrid OCR (only where needed) · page chunking with metadata · LlamaIndex and LangChain integrations · 10–250× cheaper than vision-LLM approaches
→ Documentation · PyPI · Live demo
langchain-pymupdf4llm — LangChain integration
A drop-in LangChain document loader and parser backed by pymupdf4llm. Extracts PDF content as Markdown and feeds it directly into any LangChain retrieval chain.
pip install langchain-pymupdf4llmfrom langchain_pymupdf4llm import PyMuPDF4LLMLoader
loader = PyMuPDF4LLMLoader("document.pdf", mode="single")
docs = loader.load()→ PyPI
pymupdf4llm-mcp — MCP server
An MCP (Model Context Protocol) server exposing pymupdf4llm as a tool. Gives any MCP-compatible AI client (Claude, Cursor, Windsurf) direct access to PDF-to-Markdown extraction.
uvx pymupdf4llm-mcp@latest stdio→ PyPI
PyMuPDF-Utilities — demos & examples
A collection of working example scripts, Jupyter notebooks, and GUI demos built on PyMuPDF. Covers image handling, annotation workflows, data extraction patterns, and more — useful as a recipe book alongside the official documentation.
pymupdf-fonts — optional font collection
An optional font package extending the fonts available for text output in PyMuPDF. Includes additional Unicode-compatible typefaces beyond the 14 standard PDF fonts.
pip install pymupdf-fonts| Format | PyMuPDF | pymupdf4llm |
|---|---|---|
| PDF (all versions, encrypted) | ✅ | ✅ |
| XPS / OpenXPS | ✅ | ✅ |
| EPUB / MOBI / FB2 | ✅ | ✅ |
| CBZ / CBT (comic book) | ✅ | — |
| Images (PNG, JPG, TIFF…) | ✅ | ✅ (with OCR) |
| DOCX / XLSX / PPTX / HWP | — | ✅ (Pro only) |
| SVG | ✅ (limited) | — |
All repositories in this organisation are available under the GNU AGPL v3 for open-source use, and under a commercial licence for proprietary applications. Commercial licences are available from Artifex Software — the creators and maintainers of MuPDF, PyMuPDF, and this organisation.
Maintained by Artifex Software, Inc. · pymupdf.io