|
26 | 26 |
|
27 | 27 | </div> |
28 | 28 |
|
29 | | -# Blueprint title |
| 29 | +# Converting Documents to Markdown with Docling CLI |
30 | 30 |
|
31 | | -This blueprint guides you to ... |
| 31 | +This blueprint guides you to convert various unstructured documents (PDFs, DOCX, HTML, etc.) to Markdown format using the Docling command-line interface, with special attention to OCR capabilities and image handling options. |
32 | 32 |
|
| 33 | +## Pre-requisites |
33 | 34 |
|
| 35 | +- **System requirements**: |
| 36 | + - OS: Windows, macOS, or Linux |
| 37 | + - Python 3.10 or higher |
| 38 | + - Minimum RAM: 8GB |
| 39 | + - Disk space: 4GB for models and dependencies |
| 40 | + - GPU: optional |
| 41 | + |
| 42 | +- **Dependencies**: |
| 43 | + - All Python dependencies are installed automatically with Docling |
34 | 44 |
|
35 | 45 | ## Quick-start |
36 | 46 |
|
| 47 | +Install Docling using pip: |
| 48 | + |
| 49 | +```bash |
| 50 | +pip install docling |
| 51 | +``` |
| 52 | + |
| 53 | +Basic usage to convert a PDF to Markdown: |
| 54 | + |
| 55 | +```bash |
| 56 | +# Convert a local file |
| 57 | +docling path/to/document.pdf |
| 58 | + |
| 59 | +# Convert from a URL |
| 60 | +docling https://arxiv.org/pdf/2408.09869 |
| 61 | +``` |
| 62 | + |
| 63 | +For advanced OCR with multiple languages: |
| 64 | + |
| 65 | +```bash |
| 66 | +docling path/to/document.pdf --ocr-lang en,fr,de |
| 67 | +``` |
| 68 | + |
| 69 | +To use the SmolDocling Vision Language Model (VLM) pipeline: |
| 70 | + |
| 71 | +```bash |
| 72 | +docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling |
| 73 | +``` |
37 | 74 |
|
38 | 75 | ## How it Works |
39 | 76 |
|
| 77 | +Docling is a document processing tool that parses various formats and provides a unified representation. The CLI simplifies access to its features: |
40 | 78 |
|
41 | | -## Pre-requisites |
| 79 | +1. **Document Parsing**: Docling parses your document and extracts text, tables, images, and structure |
| 80 | +2. **Layout Analysis**: For PDFs, it analyzes page layout to determine reading order |
| 81 | +3. **OCR Processing**: For scanned documents, it applies OCR to extract text |
| 82 | +4. **Markdown Conversion**: The parsed document is converted to Markdown format |
| 83 | +5. **Image Handling**: Images can be embedded, referenced, or replaced with placeholders |
42 | 84 |
|
43 | | -- **System requirements**: |
44 | | - - OS: Windows, macOS, or Linux |
45 | | - - Python 3.10 or higher |
46 | | - - Minimum RAM: |
47 | | - - Disk space: |
| 85 | +### OCR Options |
48 | 86 |
|
49 | | -- **Dependencies**: |
50 | | - - Dependencies listed in `pyproject.toml` |
| 87 | +Docling supports multiple OCR engines: |
| 88 | + |
| 89 | +#### EasyOCR (Default) |
| 90 | + |
| 91 | +```bash |
| 92 | +# Specify languages |
| 93 | +docling path/to/document.pdf --ocr-lang en,fr,de |
| 94 | + |
| 95 | +# Disable OCR entirely |
| 96 | +docling path/to/document.pdf --no-ocr |
| 97 | +``` |
| 98 | + |
| 99 | +#### Tesseract OCR |
| 100 | + |
| 101 | +```bash |
| 102 | +docling path/to/document.pdf --ocr-engine tesseract |
| 103 | +``` |
| 104 | + |
| 105 | +#### RapidOCR |
51 | 106 |
|
| 107 | +```bash |
| 108 | +# Install RapidOCR first |
| 109 | +pip install rapidocr_onnxruntime |
| 110 | + |
| 111 | +# Then use it with Docling |
| 112 | +docling path/to/document.pdf --ocr-engine rapidocr |
| 113 | +``` |
| 114 | + |
| 115 | +#### OcrMac (macOS only) |
| 116 | + |
| 117 | +```bash |
| 118 | +# Install OcrMac first |
| 119 | +pip install ocrmac |
| 120 | + |
| 121 | +# Then use it with Docling |
| 122 | +docling path/to/document.pdf --ocr-engine ocrmac |
| 123 | +``` |
| 124 | + |
| 125 | +### VLM Pipeline with SmolDocling |
| 126 | + |
| 127 | +For complex documents, the Vision Language Model pipeline with [SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) can provide better results: |
| 128 | + |
| 129 | +```bash |
| 130 | +docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling |
| 131 | +``` |
| 132 | + |
| 133 | +On Apple Silicon Macs, this automatically uses MLX acceleration for better performance. |
| 134 | + |
| 135 | +### Image Embedding Options |
| 136 | + |
| 137 | +Control how images appear in your Markdown output: |
| 138 | + |
| 139 | +#### Embedded Images (Data URLs) |
| 140 | + |
| 141 | +```bash |
| 142 | +docling path/to/document.pdf --image-mode embedded |
| 143 | +``` |
| 144 | + |
| 145 | +Embeds images directly in the Markdown file using Base64 encoding, creating a self-contained document. |
| 146 | + |
| 147 | +#### Referenced Images (Default) |
| 148 | + |
| 149 | +```bash |
| 150 | +docling path/to/document.pdf --image-mode referenced |
| 151 | +``` |
| 152 | + |
| 153 | +Saves images as separate files and references them using relative paths in the Markdown. |
| 154 | + |
| 155 | +#### Placeholder Images |
| 156 | + |
| 157 | +```bash |
| 158 | +docling path/to/document.pdf --image-mode placeholder |
| 159 | +``` |
| 160 | + |
| 161 | +Replaces images with placeholder text in the Markdown. |
| 162 | + |
| 163 | +### Batch Processing |
| 164 | + |
| 165 | +Convert multiple files at once: |
| 166 | + |
| 167 | +```bash |
| 168 | +docling ./documents/ --from pdf --to md --output ./markdown_files |
| 169 | +``` |
52 | 170 |
|
53 | 171 | ## Troubleshooting |
54 | 172 |
|
| 173 | +### OCR Issues |
| 174 | + |
| 175 | +If you encounter OCR problems: |
| 176 | + |
| 177 | +```bash |
| 178 | +# Try a different OCR engine |
| 179 | +docling path/to/document.pdf --ocr-engine tesseract |
| 180 | + |
| 181 | +# Force OCR on the entire page |
| 182 | +docling path/to/document.pdf --force-full-page-ocr |
| 183 | +``` |
55 | 184 |
|
56 | 185 | ## License |
57 | 186 |
|
58 | 187 | This project is licensed under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details. |
59 | 188 |
|
60 | 189 | ## Contributing |
61 | 190 |
|
62 | | -Contributions are welcome! To get started, you can check out the [CONTRIBUTING.md](CONTRIBUTING.md) file. |
| 191 | +Contributions are welcome! To get started, you can check out the [CONTRIBUTING.md](CONTRIBUTING.md) file. |
0 commit comments