Skip to content

Commit 08bcc33

Browse files
committed
First version
1 parent bfb83a2 commit 08bcc33

File tree

1 file changed

+140
-11
lines changed

1 file changed

+140
-11
lines changed

README.md

Lines changed: 140 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,37 +26,166 @@
2626

2727
</div>
2828

29-
# Blueprint title
29+
# Converting Documents to Markdown with Docling CLI
3030

31-
This blueprint guides you to ...
31+
This blueprint guides you to convert various unstructured documents (PDFs, DOCX, HTML, etc.) to Markdown format using the Docling command-line interface, with special attention to OCR capabilities and image handling options.
3232

33+
## Pre-requisites
3334

35+
- **System requirements**:
36+
- OS: Windows, macOS, or Linux
37+
- Python 3.10 or higher
38+
- Minimum RAM: 8GB
39+
- Disk space: 4GB for models and dependencies
40+
- GPU: optional
41+
42+
- **Dependencies**:
43+
- All Python dependencies are installed automatically with Docling
3444

3545
## Quick-start
3646

47+
Install Docling using pip:
48+
49+
```bash
50+
pip install docling
51+
```
52+
53+
Basic usage to convert a PDF to Markdown:
54+
55+
```bash
56+
# Convert a local file
57+
docling path/to/document.pdf
58+
59+
# Convert from a URL
60+
docling https://arxiv.org/pdf/2408.09869
61+
```
62+
63+
For advanced OCR with multiple languages:
64+
65+
```bash
66+
docling path/to/document.pdf --ocr-lang en,fr,de
67+
```
68+
69+
To use the SmolDocling Vision Language Model (VLM) pipeline:
70+
71+
```bash
72+
docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling
73+
```
3774

3875
## How it Works
3976

77+
Docling is a document processing tool that parses various formats and provides a unified representation. The CLI simplifies access to its features:
4078

41-
## Pre-requisites
79+
1. **Document Parsing**: Docling parses your document and extracts text, tables, images, and structure
80+
2. **Layout Analysis**: For PDFs, it analyzes page layout to determine reading order
81+
3. **OCR Processing**: For scanned documents, it applies OCR to extract text
82+
4. **Markdown Conversion**: The parsed document is converted to Markdown format
83+
5. **Image Handling**: Images can be embedded, referenced, or replaced with placeholders
4284

43-
- **System requirements**:
44-
- OS: Windows, macOS, or Linux
45-
- Python 3.10 or higher
46-
- Minimum RAM:
47-
- Disk space:
85+
### OCR Options
4886

49-
- **Dependencies**:
50-
- Dependencies listed in `pyproject.toml`
87+
Docling supports multiple OCR engines:
88+
89+
#### EasyOCR (Default)
90+
91+
```bash
92+
# Specify languages
93+
docling path/to/document.pdf --ocr-lang en,fr,de
94+
95+
# Disable OCR entirely
96+
docling path/to/document.pdf --no-ocr
97+
```
98+
99+
#### Tesseract OCR
100+
101+
```bash
102+
docling path/to/document.pdf --ocr-engine tesseract
103+
```
104+
105+
#### RapidOCR
51106

107+
```bash
108+
# Install RapidOCR first
109+
pip install rapidocr_onnxruntime
110+
111+
# Then use it with Docling
112+
docling path/to/document.pdf --ocr-engine rapidocr
113+
```
114+
115+
#### OcrMac (macOS only)
116+
117+
```bash
118+
# Install OcrMac first
119+
pip install ocrmac
120+
121+
# Then use it with Docling
122+
docling path/to/document.pdf --ocr-engine ocrmac
123+
```
124+
125+
### VLM Pipeline with SmolDocling
126+
127+
For complex documents, the Vision Language Model pipeline with [SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) can provide better results:
128+
129+
```bash
130+
docling path/to/document.pdf --pipeline vlm --vlm-model smoldocling
131+
```
132+
133+
On Apple Silicon Macs, this automatically uses MLX acceleration for better performance.
134+
135+
### Image Embedding Options
136+
137+
Control how images appear in your Markdown output:
138+
139+
#### Embedded Images (Data URLs)
140+
141+
```bash
142+
docling path/to/document.pdf --image-mode embedded
143+
```
144+
145+
Embeds images directly in the Markdown file using Base64 encoding, creating a self-contained document.
146+
147+
#### Referenced Images (Default)
148+
149+
```bash
150+
docling path/to/document.pdf --image-mode referenced
151+
```
152+
153+
Saves images as separate files and references them using relative paths in the Markdown.
154+
155+
#### Placeholder Images
156+
157+
```bash
158+
docling path/to/document.pdf --image-mode placeholder
159+
```
160+
161+
Replaces images with placeholder text in the Markdown.
162+
163+
### Batch Processing
164+
165+
Convert multiple files at once:
166+
167+
```bash
168+
docling ./documents/ --from pdf --to md --output ./markdown_files
169+
```
52170

53171
## Troubleshooting
54172

173+
### OCR Issues
174+
175+
If you encounter OCR problems:
176+
177+
```bash
178+
# Try a different OCR engine
179+
docling path/to/document.pdf --ocr-engine tesseract
180+
181+
# Force OCR on the entire page
182+
docling path/to/document.pdf --force-full-page-ocr
183+
```
55184

56185
## License
57186

58187
This project is licensed under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.
59188

60189
## Contributing
61190

62-
Contributions are welcome! To get started, you can check out the [CONTRIBUTING.md](CONTRIBUTING.md) file.
191+
Contributions are welcome! To get started, you can check out the [CONTRIBUTING.md](CONTRIBUTING.md) file.

0 commit comments

Comments
 (0)