Blueprints Hub | Documentation | Contributing
🤝 This Blueprint was a result of an EleutherAI <> mozilla.ai collaboration, as part of their work on Open Datasets for LLM Training.
Parse and convert Documents with Docling
This blueprint guides you to convert various unstructured documents (PDFs, DOCX, HTML, etc.) to markdown, or other, formats using the Docling CLI or a locally-hosted demo UI, with special attention to OCR capabilities and image handling options.
- Quick-start
- How it Works
- Features & Configuration
- Hardware requirements
- Troubleshooting
- License
- Contributing
We have built a simple Graphical Interface demo of Docling to showcase some basic functionality. To utilize the full set of features, see section Local CLI for the full Docling experience! You can try the demo in two ways:
You can also run the demo locally. First, clone the repository:
git clone https://github.com/mozilla-ai/document-to-markdown.git
Then, navigate to the directory, create a virtual environment and install the requirements:
cd document-to-markdown/demo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Finally, run the demo:
python app.py
This will start a local server, and you can access the demo at http://127.0.0.1:7860
.
Install Docling using pip:
pip install docling
Basic usage to convert a PDF to Markdown:
# Convert a local file
docling path/to/document.pdf
# Convert from a URL
docling https://arxiv.org/pdf/2408.09869
For advanced OCR with multiple languages:
docling path/to/document.pdf --ocr-lang en,fr,de
Docling is a document processing tool that parses various formats and provides a unified representation. The CLI simplifies access to its features:
- Document Parsing: Docling parses your document and extracts text, tables, images, and structure
- Layout Analysis: For PDFs, it analyzes page layout to determine reading order
- OCR Processing: For scanned documents, it applies OCR to extract text
- Markdown Conversion: The parsed document is converted to Markdown format
- Image Handling: Images can be embedded, referenced, or replaced with placeholders
Note: These are only a few samples of the full set of features of Docling! Visit https://github.com/docling-project/docling for an up-to-date list of all the features and configurations.
Docling supports multiple OCR engines:
# Specify languages
docling path/to/document.pdf --ocr-lang en,fr,de
# Disable OCR entirely
docling path/to/document.pdf --no-ocr
docling path/to/document.pdf --ocr-engine tesseract
# Install RapidOCR first
pip install rapidocr_onnxruntime
# Then use it with Docling
docling path/to/document.pdf --ocr-engine rapidocr
# Install OcrMac first
pip install ocrmac
# Then use it with Docling
docling path/to/document.pdf --ocr-engine ocrmac
Using the VLM Pipeline, we can use a Vision Language Model with SmolDocling to describe images:
docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling
We can also use EfficientNet-B0 Document Image Classifier to classify images:
docling path/to/document.pdf --enrich-picture-classes
docling path/to/document.pdf --enrich-code
docling path/to/document.pdf --enrich-formula
On Apple Silicon Macs, this automatically uses MLX acceleration for better performance.
Control how images appear in your Markdown output:
docling path/to/document.pdf --image-mode embedded
Embeds images directly in the Markdown file using Base64 encoding, creating a self-contained document.
docling path/to/document.pdf --image-mode referenced
Saves images as separate files and references them using relative paths in the Markdown.
docling path/to/document.pdf --image-mode placeholder
Replaces images with placeholder text in the Markdown.
Convert multiple files at once:
docling ./documents/ --from pdf --to md --output ./markdown_files
- OS: Windows, macOS, or Linux
- Python 3.10 or higher
- Minimum RAM: 8GB
- Disk space: 4GB for models and dependencies
- GPU: optional
If you encounter OCR problems:
# Try a different OCR engine
docling path/to/document.pdf --ocr-engine tesseract
# Force OCR on the entire page
docling path/to/document.pdf --force-full-page-ocr
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.