Parse and convert Documents with Docling

Blueprints Hub | Documentation | Contributing

🤝 This Blueprint was a result of an EleutherAI <> mozilla.ai collaboration, as part of their work on Open Datasets for LLM Training.

Parse and convert Documents with Docling

This blueprint guides you to convert various unstructured documents (PDFs, DOCX, HTML, etc.) to markdown, or other, formats using the Docling CLI or a locally-hosted demo UI, with special attention to OCR capabilities and image handling options.

Quick-start

We have built a simple Graphical Interface demo of Docling to showcase some basic functionality. To utilize the full set of features, see section Local CLI for the full Docling experience! You can try the demo in two ways:

HF Spaces Demo

Local Demo

You can also run the demo locally. First, clone the repository:

git clone https://github.com/mozilla-ai/document-to-markdown.git

Then, navigate to the directory, create a virtual environment and install the requirements:

cd document-to-markdown/demo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Finally, run the demo:

python app.py

This will start a local server, and you can access the demo at http://127.0.0.1:7860.

Local CLI for the full Docling experience!

Install Docling using pip:

pip install docling

Basic usage to convert a PDF to Markdown:

# Convert a local file
docling path/to/document.pdf

# Convert from a URL
docling https://arxiv.org/pdf/2408.09869

For advanced OCR with multiple languages:

docling path/to/document.pdf --ocr-lang en,fr,de

How it Works

Docling is a document processing tool that parses various formats and provides a unified representation. The CLI simplifies access to its features:

Document Parsing: Docling parses your document and extracts text, tables, images, and structure
Layout Analysis: For PDFs, it analyzes page layout to determine reading order
OCR Processing: For scanned documents, it applies OCR to extract text
Markdown Conversion: The parsed document is converted to Markdown format
Image Handling: Images can be embedded, referenced, or replaced with placeholders

Features & Configuration

Note: These are only a few samples of the full set of features of Docling! Visit https://github.com/docling-project/docling for an up-to-date list of all the features and configurations.

OCR Options

Docling supports multiple OCR engines:

EasyOCR (Default)

# Specify languages
docling path/to/document.pdf --ocr-lang en,fr,de

# Disable OCR entirely
docling path/to/document.pdf --no-ocr

Tesseract OCR

docling path/to/document.pdf --ocr-engine tesseract

RapidOCR

# Install RapidOCR first
pip install rapidocr_onnxruntime

# Then use it with Docling
docling path/to/document.pdf --ocr-engine rapidocr

OcrMac (macOS only)

# Install OcrMac first
pip install ocrmac

# Then use it with Docling
docling path/to/document.pdf --ocr-engine ocrmac

Parse Images with SmolDocling

Using the VLM Pipeline, we can use a Vision Language Model with SmolDocling to describe images:

docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling

We can also use EfficientNet-B0 Document Image Classifier to classify images:

docling path/to/document.pdf --enrich-picture-classes

Parse Code

docling path/to/document.pdf --enrich-code

Parse Formulas

docling path/to/document.pdf --enrich-formula

On Apple Silicon Macs, this automatically uses MLX acceleration for better performance.

Image Embedding Options

Control how images appear in your Markdown output:

Embedded Images (Data URLs)

docling path/to/document.pdf --image-mode embedded

Embeds images directly in the Markdown file using Base64 encoding, creating a self-contained document.

Referenced Images (Default)

docling path/to/document.pdf --image-mode referenced

Saves images as separate files and references them using relative paths in the Markdown.

Placeholder Images

docling path/to/document.pdf --image-mode placeholder

Replaces images with placeholder text in the Markdown.

Batch Processing

Convert multiple files at once:

docling ./documents/ --from pdf --to md --output ./markdown_files

Hardware requirements:

OS: Windows, macOS, or Linux
Python 3.10 or higher
Minimum RAM: 8GB
Disk space: 4GB for models and dependencies
GPU: optional

Troubleshooting

OCR Issues

If you encounter OCR problems:

# Try a different OCR engine
docling path/to/document.pdf --ocr-engine tesseract

# Force OCR on the entire page
docling path/to/document.pdf --force-full-page-ocr

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
demo		demo
images		images
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse and convert Documents with Docling

Table of Contents

Quick-start

HF Spaces Demo

Local Demo

Local CLI for the full Docling experience!

How it Works

Features & Configuration

OCR Options

EasyOCR (Default)

Tesseract OCR

RapidOCR

OcrMac (macOS only)

Parse Images with SmolDocling

Parse Code

Parse Formulas

Image Embedding Options

Embedded Images (Data URLs)

Referenced Images (Default)

Placeholder Images

Batch Processing

Hardware requirements:

Troubleshooting

OCR Issues

License

Contributing

About

Releases

Packages

Contributors 3

License

mozilla-ai/document-to-markdown

Folders and files

Latest commit

History

Repository files navigation

Parse and convert Documents with Docling

Table of Contents

Quick-start

HF Spaces Demo

Local Demo

Local CLI for the full Docling experience!

How it Works

Features & Configuration

OCR Options

EasyOCR (Default)

Tesseract OCR

RapidOCR

OcrMac (macOS only)

Parse Images with SmolDocling

Parse Code

Parse Formulas

Image Embedding Options

Embedded Images (Data URLs)

Referenced Images (Default)

Placeholder Images

Batch Processing

Hardware requirements:

Troubleshooting

OCR Issues

License

Contributing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages