diff --git a/docs/examples/agent_skill/docling-document-intelligence/EXAMPLE.md b/docs/examples/agent_skill/docling-document-intelligence/EXAMPLE.md new file mode 100644 index 0000000000..b6993b1646 --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/EXAMPLE.md @@ -0,0 +1,99 @@ +# Using the Docling agent skill + +[Agent Skills](https://agentskills.io/specification) are folders of instructions that AI coding agents (Cursor, Claude Code, GitHub Copilot, etc.) can load when relevant. + +## Where this bundle lives + +- **Cursor (local):** `~/.cursor/skills/docling-document-intelligence/` (or copy this folder there). +- **Docling repository (docs + PRs):** `docs/examples/agent_skill/docling-document-intelligence/` in [github.com/docling-project/docling](https://github.com/docling-project/docling). + +The two trees are kept in sync; use either source. + +## Install (copy into your agent's skills directory) + +```bash +# From a checkout of the Docling repo +cp -r docs/examples/agent_skill/docling-document-intelligence ~/.cursor/skills/ + +# Or copy from another machine / archive into e.g. ~/.claude/skills/ +``` + +No extra config is required beyond installing Python dependencies (below). + +## Usage + +Open your agent-enabled IDE and ask, for example: + +``` +Parse report.pdf and give me a structural outline +``` + +``` +Convert https://arxiv.org/pdf/2408.09869 to markdown +``` + +``` +Chunk invoice.pdf for RAG ingestion with 512 token chunks +``` + +``` +Process scanned.pdf using the VLM pipeline +``` + +The agent should read `SKILL.md`, match the task, and run the appropriate +`docling` CLI command or Python API call. + +## Running the docling CLI directly + +```bash +pip install docling docling-core + +# Basic conversion to Markdown +docling report.pdf --output /tmp/ + +# JSON output +docling report.pdf --to json --output /tmp/ + +# Custom OCR engine +docling report.pdf --ocr-engine rapidocr --output /tmp/ + +# VLM pipeline +docling scanned.pdf --pipeline vlm --output /tmp/ + +# VLM with specific model +docling scanned.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/ + +# Remote VLM services +docling doc.pdf --pipeline vlm --enable-remote-services --output /tmp/ +``` + +## Evaluate and refine + +```bash +docling report.pdf --to json --output /tmp/ +docling report.pdf --to md --output /tmp/ +python3 scripts/docling-evaluate.py /tmp/report.json --markdown /tmp/report.md +``` + +If the report shows `warn` or `fail`, follow `recommended_actions`, re-convert +with `docling` using the suggested flags, and optionally append a note to +`improvement-log.md` (see `SKILL.md` section 7). + +## What the skill covers + +| Task | How to ask | +|---|---| +| Parse PDF / DOCX / PPTX / HTML / image | "parse this file" | +| Convert to Markdown | "convert to markdown" | +| Export as structured JSON | "export as JSON" | +| Chunk for RAG | "chunk for RAG", "prepare for ingestion" | +| Analyze structure | "show me the headings and tables" | +| Use VLM pipeline | "use the VLM pipeline", "process scanned PDF" | +| Use remote inference | "use vLLM", "call the API pipeline" | + +## Further reading + +- [Agent Skills specification](https://agentskills.io/specification) +- [Docling documentation](https://docling-project.github.io/docling/) +- [Docling CLI reference](https://docling-project.github.io/docling/reference/cli/) +- [Docling GitHub](https://github.com/docling-project/docling) diff --git a/docs/examples/agent_skill/docling-document-intelligence/README.md b/docs/examples/agent_skill/docling-document-intelligence/README.md new file mode 100644 index 0000000000..65382ccb9d --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/README.md @@ -0,0 +1,43 @@ +# Docling agent skill (Cursor & compatible assistants) + +This folder is an **[Agent Skill](https://agentskills.io/specification)**-style bundle for AI coding assistants: structured instructions (`SKILL.md`), a pipeline reference (`pipelines.md`), and a quality evaluator (`scripts/docling-evaluate.py`). + +Conversion is done via the **`docling` CLI** (included with `pip install docling`). +The evaluator provides a **convert → evaluate → refine** feedback loop that the +existing CLI does not cover. + +It complements the official [Docling documentation](https://docling-project.github.io/docling/) and the [`docling` CLI reference](https://docling-project.github.io/docling/reference/cli/). + +The same layout is published in the Docling repo at `docs/examples/agent_skill/docling-document-intelligence/` (for docs and PRs). + +## Contents + +| Path | Purpose | +|------|---------| +| [`SKILL.md`](SKILL.md) | Full skill instructions (pipelines, chunking, evaluation loop) | +| [`pipelines.md`](pipelines.md) | Standard vs VLM pipelines, OCR engines, API notes | +| [`EXAMPLE.md`](EXAMPLE.md) | Installing into `~/.cursor/skills/`; running the CLI and evaluator | +| [`improvement-log.md`](improvement-log.md) | Optional template for local "what worked" notes | +| [`scripts/docling-evaluate.py`](scripts/docling-evaluate.py) | Heuristic quality report on JSON (+ optional Markdown) | +| [`scripts/requirements.txt`](scripts/requirements.txt) | Minimal pip deps for the evaluator | + +## Quick start + +```bash +pip install docling docling-core + +# Convert to Markdown +docling https://arxiv.org/pdf/2408.09869 --output /tmp/ + +# Convert to JSON +docling https://arxiv.org/pdf/2408.09869 --to json --output /tmp/ + +# Evaluate quality +python3 scripts/docling-evaluate.py /tmp/2408.09869.json --markdown /tmp/2408.09869.md +``` + +Use `--pipeline vlm` for vision-model pipelines; see `SKILL.md` and `pipelines.md`. + +## License + +MIT (aligned with [Docling](https://github.com/docling-project/docling)). diff --git a/docs/examples/agent_skill/docling-document-intelligence/SKILL.md b/docs/examples/agent_skill/docling-document-intelligence/SKILL.md new file mode 100644 index 0000000000..7e3927a680 --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/SKILL.md @@ -0,0 +1,393 @@ +--- +name: docling-document-intelligence +description: > + Parse, convert, chunk, and analyze documents using Docling. Use this skill + when the user provides a document (PDF, DOCX, PPTX, HTML, image) as a file + path or URL and wants to: extract text or structured content, convert to + Markdown or JSON, chunk the document for RAG ingestion, analyze document + structure (headings, tables, figures, reading order), or run quality + evaluation with iterative pipeline tuning. Triggers: "parse this PDF", + "convert to markdown", "chunk for RAG", "extract tables", "analyze document + structure", "prepare for ingestion", "process document", "evaluate docling + output", "improve conversion quality". +license: MIT +compatibility: Requires Python 3.10+, docling>=2.81.0, docling-core>=2.67.1 +metadata: + author: docling-project + version: "2.0" + upstream: https://github.com/docling-project/docling +allowed-tools: Bash(docling:*) Bash(python3:*) Bash(pip:*) +--- + +# Docling Document Intelligence Skill + +Use this skill to parse, convert, chunk, and analyze documents with Docling. +It handles both local file paths and URLs, and outputs either Markdown or +structured JSON (`DoclingDocument`). + +Conversion uses the **`docling` CLI** (installed with `pip install docling`). +The Python API is used only for features the CLI does not expose (chunking, +VLM remote-API endpoint configuration, hybrid `force_backend_text` mode). + +## Scope + +| Task | Covered | +|---|---| +| Parse PDF / DOCX / PPTX / HTML / image | ✅ | +| Convert to Markdown | ✅ | +| Export as DoclingDocument JSON | ✅ | +| Chunk for RAG (hybrid: heading + token) | ✅ (Python API) | +| Analyze structure (headings, tables, figures) | ✅ (Python API) | +| OCR for scanned PDFs | ✅ (auto-enabled) | +| Multi-source batch conversion | ✅ | + +## Step-by-Step Instructions + +### 1. Resolve the input + +Determine whether the user supplied a **local path** or a **URL**. +The `docling` CLI accepts both directly. + +```bash +docling path/to/file.pdf +docling https://example.com/a.pdf +``` + +### 2. Choose a pipeline + +Docling has two pipeline families. Pick based on document type and hardware. + +| Pipeline | CLI flag | Best for | Key tradeoff | +|---|---|---|---| +| **Standard** (default) | `--pipeline standard` | Born-digital PDFs, speed | No GPU needed; OCR for scanned pages | +| **VLM** | `--pipeline vlm` | Complex layouts, handwriting, formulas | Needs GPU; slower | + +See [pipelines.md](pipelines.md) for the full decision matrix, OCR engine table +(EasyOCR, RapidOCR, Tesseract, macOS), and VLM model presets. + +### 3. Convert the document + +#### CLI (preferred for straightforward conversions) + +```bash +# Markdown (default output) +docling report.pdf --output /tmp/ + +# JSON (structured, lossless) +docling report.pdf --to json --output /tmp/ + +# VLM pipeline +docling report.pdf --pipeline vlm --output /tmp/ + +# VLM with specific model +docling report.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/ + +# Custom OCR engine +docling report.pdf --ocr-engine tesserocr --output /tmp/ + +# Disable OCR or tables for speed +docling report.pdf --no-ocr --output /tmp/ +docling report.pdf --no-tables --output /tmp/ + +# Remote VLM services +docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/ +``` + +The CLI writes output files to the `--output` directory, named after the +input file (e.g. `report.pdf` → `report.md` or `report.json`). + +**CLI reference:** + +#### Python API (for advanced features) + +Use the Python API when you need features the CLI does not expose: +chunking, VLM remote-API endpoint configuration, or hybrid +`force_backend_text` mode. + +**Docling 2.81+ API note:** `DocumentConverter(format_options=...)` expects +`dict[InputFormat, FormatOption]` (e.g. `InputFormat.PDF` → `PdfFormatOption`). +Using string keys like `{"pdf": PdfPipelineOptions(...)}` fails at runtime with +`AttributeError: 'PdfPipelineOptions' object has no attribute 'backend'`. + +**Standard pipeline (default):** +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions + +converter = DocumentConverter() +result = converter.convert("report.pdf") + +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption( + pipeline_options=PdfPipelineOptions(do_ocr=True, do_table_structure=True), + ), + } +) +result = converter.convert("report.pdf") +``` + +**VLM pipeline — local (GraniteDocling via HF Transformers):** +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import VlmPipelineOptions +from docling.datamodel import vlm_model_specs +from docling.pipeline.vlm_pipeline import VlmPipeline + +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS, + generate_page_images=True, +) +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption( + pipeline_cls=VlmPipeline, + pipeline_options=pipeline_options, + ) + } +) +result = converter.convert("report.pdf") +``` + +**VLM pipeline — remote API (vLLM / LM Studio / Ollama):** + +This is only available via the Python API; the CLI does not expose endpoint +URL, model name, or API key configuration. + +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import VlmPipelineOptions +from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat +from docling.pipeline.vlm_pipeline import VlmPipeline + +vlm_opts = ApiVlmOptions( + url="http://localhost:8000/v1/chat/completions", + params=dict(model="ibm-granite/granite-docling-258M", max_tokens=4096), + prompt="Convert this page to docling.", + response_format=ResponseFormat.DOCTAGS, + timeout=120, +) +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_opts, + generate_page_images=True, + enable_remote_services=True, # required — gates all outbound HTTP +) +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption( + pipeline_cls=VlmPipeline, + pipeline_options=pipeline_options, + ) + } +) +result = converter.convert("report.pdf") +``` + +**Hybrid mode (force_backend_text) — Python API only:** + +Uses deterministic PDF text extraction for text regions while routing +images and tables through the VLM. Reduces hallucination on text-heavy pages. + +```python +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS, + force_backend_text=True, + generate_page_images=True, +) +``` + +`result.document` is a `DoclingDocument` object in all cases. + +### 4. Choose output format + +**Markdown** (default, human-readable): +```bash +docling report.pdf --to md --output /tmp/ +``` +Or via Python: `result.document.export_to_markdown()` + +**JSON / DoclingDocument** (structured, lossless): +```bash +docling report.pdf --to json --output /tmp/ +``` +Or via Python: `result.document.export_to_dict()` + +> If the user does not specify a format, ask: "Should I output Markdown or +> structured JSON (DoclingDocument)?" + +### 5. Chunk for RAG (hybrid strategy) + +Chunking is only available via the Python API. + +Default: **hybrid chunker** — splits first by heading hierarchy, then +subdivides oversized sections by token count. This preserves semantic +boundaries while respecting model context limits. + +The tokenizer API changed in docling-core 2.8.0. Pass a `BaseTokenizer` +object, not a raw string: + +**HuggingFace tokenizer (default):** +```python +from docling.chunking import HybridChunker +from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer + +tokenizer = HuggingFaceTokenizer.from_pretrained( + model_name="sentence-transformers/all-MiniLM-L6-v2", + max_tokens=512, +) +chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True) +chunks = list(chunker.chunk(result.document)) + +for chunk in chunks: + embed_text = chunker.contextualize(chunk) + print(chunk.meta.headings) # heading breadcrumb list + print(chunk.meta.origin.page_no) # source page number +``` + +**OpenAI tokenizer (for OpenAI embedding models):** +```python +import tiktoken +from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer + +tokenizer = OpenAITokenizer( + tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"), + max_tokens=8192, +) +# Requires: pip install 'docling-core[chunking-openai]' +``` + +For chunking strategies and tokenizer details, see the Docling documentation +on chunking and `HybridChunker`. + +### 6. Analyze document structure + +Use the `DoclingDocument` object directly to inspect structure: + +```python +doc = result.document + +for item, level in doc.iterate_items(): + if hasattr(item, 'label') and item.label.name == 'SECTION_HEADER': + print(f"{'#' * level} {item.text}") + +for table in doc.tables: + print(table.export_to_dataframe()) # pandas DataFrame + print(table.export_to_markdown()) + +for picture in doc.pictures: + print(picture.caption_text(doc)) # caption if present +``` + +For the full API surface, see Docling's structure and table export docs. + +### 7. Evaluate output and iterate (required for "best effort" conversions) + +After **every** conversion where the user cares about fidelity (not quick +previews), run the bundled evaluator on the JSON export, then refine the +pipeline if needed. This is how the agent **checks its work** and **improves +the run** without guessing. + +**Step A — Produce JSON and optional Markdown** + +```bash +docling "" --to json --output /tmp/ +docling "" --to md --output /tmp/ +``` + +**Step B — Evaluate** + +```bash +python3 scripts/docling-evaluate.py /tmp/.json --markdown /tmp/.md +``` + +If the user expects tables (invoices, spreadsheets in PDF), add +`--expect-tables`. Tighten gates with `--fail-on-warn` in CI-style checks. + +The script prints a JSON report to stdout: `status` (`pass` | `warn` | `fail`), +`metrics`, `issues`, and `recommended_actions` (concrete `docling` CLI +flags to try next). + +**Step C — Refinement loop (max 3 attempts unless the user says otherwise)** + +1. If `status` is `warn` or `fail`, apply **one** primary change from + `recommended_actions` (e.g. switch `--pipeline vlm`, change + `--ocr-engine`, ensure tables are enabled). +2. Re-convert with `docling`, re-run `scripts/docling-evaluate.py`. +3. Stop when `status` is `pass`, or after 3 iterations — then summarize what + worked and any remaining issues for the user. + +**Step D — Self-improvement log (skill memory)** + +After a successful pass **or** after the final iteration, append one entry to +[improvement-log.md](improvement-log.md) in this skill directory: + +- Source type (e.g. scanned PDF, digital PDF, DOCX) +- First-run problems (from `issues`) +- Pipeline + flags that fixed or best mitigated them +- Final `status` and one line of subjective quality notes + +This log is optional for the user to git-ignore; it is for **local** learning +so future runs on similar documents start closer to the right pipeline. + +### 8. Agent quality checklist (manual, if script unavailable) + +If `scripts/docling-evaluate.py` cannot run, still verify: + +| Check | Action if bad | +|---|---| +| Page count matches source (roughly) | Re-run; try `--pipeline vlm` if layout is complex | +| Markdown is not near-empty | Enable OCR / VLM | +| Tables missing when visually obvious | Remove `--no-tables`; try `--pipeline vlm` | +| `\ufffd` replacement characters | Different `--ocr-engine` or `--pipeline vlm` | +| Same line repeated many times | `--pipeline vlm` or hybrid `force_backend_text` (Python API) | + +## Common Edge Cases + +| Situation | Handling | +|---|---| +| Scanned / image-only PDF | Standard pipeline with OCR, or `--pipeline vlm` for best quality | +| Password-protected PDF | `--pdf-password PASSWORD`; will raise `ConversionError` if wrong | +| Very large document (500+ pages) | Standard pipeline with `--no-tables` for speed | +| Complex layout / multi-column | `--pipeline vlm`; standard may misorder reading flow | +| Handwriting or formulas | `--pipeline vlm` only — standard OCR will not handle these | +| URL behind auth | Pre-download to temp file; pass local path | +| Tables with merged cells | `table.export_to_markdown()` handles spans; VLM often more accurate | +| Non-UTF-8 encoding | Docling normalises internally; no special handling needed | +| VLM hallucinating text | `force_backend_text=True` via Python API for hybrid mode | +| VLM API call blocked | `--enable-remote-services` (CLI) or `enable_remote_services=True` (Python) | +| Apple Silicon | `--vlm-model granite_docling` with MLX backend, or `GRANITEDOCLING_MLX` preset (Python API) | + +## Pipeline reference + +Full decision matrix, all OCR engine options, VLM model presets, and API +server configuration: [pipelines.md](pipelines.md) + +## Output conventions + +- Always report the number of pages and conversion status. +- When evaluation is in scope, report evaluator `status`, top `issues`, and + which refinement attempt produced the final output. +- For Markdown output: wrap in a fenced code block only if the user will copy/paste it; otherwise render directly. +- For JSON output: pretty-print with `indent=2` unless the user specifies otherwise. +- For chunks: report total chunk count, min/max/avg token counts. +- For structure analysis: summarise heading tree + table count + figure count before going into detail. + +## Dependencies + +```bash +pip install docling docling-core +# For OpenAI tokenizer support: +pip install 'docling-core[chunking-openai]' +``` + +The `docling` CLI is included with the `docling` package — no separate install needed. + +Check installed versions (prefer distribution metadata — `docling` may not set `__version__`): + +```python +from importlib.metadata import version +print(version("docling"), version("docling-core")) +``` diff --git a/docs/examples/agent_skill/docling-document-intelligence/improvement-log.md b/docs/examples/agent_skill/docling-document-intelligence/improvement-log.md new file mode 100644 index 0000000000..092c4043d2 --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/improvement-log.md @@ -0,0 +1,20 @@ +# Docling agent skill — improvement log + +Agents may append a short entry after running **evaluate → refine** on a document +so similar files are faster to process next time. This file is optional and is +not tracked by every user; it is meant for **local** learning. + +## Template (copy for each entry) + +```markdown +### YYYY-MM-DD — +- **Source type:** (e.g. scanned PDF / digital PDF / DOCX / URL) +- **Issues (first run):** … +- **Pipeline / flags that helped:** … +- **Final evaluator status:** pass | warn | fail +- **Notes:** … +``` + +## Entries + +_(None — add your own after running conversions.)_ diff --git a/docs/examples/agent_skill/docling-document-intelligence/pipelines.md b/docs/examples/agent_skill/docling-document-intelligence/pipelines.md new file mode 100644 index 0000000000..d52208fa18 --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/pipelines.md @@ -0,0 +1,253 @@ +# Docling Pipelines Reference + +Docling has two pipeline families for PDFs: **standard** (parse + OCR + layout/tables) +and **VLM** (page images through a vision-language model). The `docling` CLI +exposes both via `--pipeline standard` (default) and `--pipeline vlm`. +The right choice depends on document type, hardware, and latency budget. + +--- + +## Decision matrix + +| Document type | Recommended pipeline | Reason | +|---|---|---| +| Born-digital PDF (text selectable) | Standard | Fast, accurate, no GPU needed | +| Scanned PDF / image-only | Standard + OCR or VLM | Depends on quality | +| Complex layout (multi-column, dense tables) | VLM | Better structural understanding | +| Handwriting, formulas, figures with embedded text | VLM | Only viable option | +| Air-gapped / no GPU | Standard | Runs on CPU | +| Production scale, GPU server available | VLM (vLLM) | Best throughput | +| Apple Silicon / local dev | VLM (MLX) | MPS acceleration | +| Speed-critical, accuracy secondary | Standard, no tables | Fastest path | + +--- + +## Pipeline 1: Standard PDF Pipeline + +Uses deterministic PDF parsing (docling-parse) + optional neural OCR + neural +table structure detection. + +### CLI usage + +```bash +# Default (standard pipeline, OCR + tables enabled) +docling report.pdf --output /tmp/ + +# Custom OCR engine +docling report.pdf --ocr-engine tesserocr --output /tmp/ + +# Disable OCR or tables +docling report.pdf --no-ocr --output /tmp/ +docling report.pdf --no-tables --output /tmp/ +``` + +### Python API + +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions + +# Minimal — library defaults (standard PDF pipeline) +converter = DocumentConverter() + +# Explicit PdfPipelineOptions (docling 2.81+): use InputFormat.PDF + PdfFormatOption. +# Do not use format_options={"pdf": opts}; that raises AttributeError on pipeline options. +opts = PdfPipelineOptions( + do_ocr=True, # False = skip OCR entirely + do_table_structure=True, # False = skip table detection (faster) +) +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=opts), + } +) +``` + +### OCR engine options + +All engines are plug-and-play via the CLI `--ocr-engine` flag or the Python +`ocr_options` parameter. Default is EasyOCR. + +#### CLI flags + +| Engine | CLI flag | Notes | +|--------|----------|-------| +| EasyOCR | `--ocr-engine easyocr` (default) | No extra pip beyond docling defaults | +| RapidOCR | `--ocr-engine rapidocr` | Lightweight; see Docling notes on read-only FS | +| Tesseract (Python) | `--ocr-engine tesserocr` | Needs `pip install tesserocr` and system Tesseract | +| Tesseract (CLI) | `--ocr-engine tesseract` | Shells out to `tesseract` binary | +| macOS Vision | `--ocr-engine ocrmac` | macOS only | + +#### Python API + +```python +# EasyOCR (default — no extra install needed) +from docling.datamodel.pipeline_options import PdfPipelineOptions +opts = PdfPipelineOptions(do_ocr=True) # uses EasyOCR by default + +# Tesseract (requires system Tesseract + pip install tesserocr — see Docling install docs) +from docling.datamodel.pipeline_options import TesseractOcrOptions +opts = PdfPipelineOptions(do_ocr=True, ocr_options=TesseractOcrOptions()) + +# RapidOCR (lightweight, no C deps) +from docling.datamodel.pipeline_options import RapidOcrOptions +opts = PdfPipelineOptions(do_ocr=True, ocr_options=RapidOcrOptions()) + +# macOS native OCR +from docling.datamodel.pipeline_options import OcrMacOptions +opts = PdfPipelineOptions(do_ocr=True, ocr_options=OcrMacOptions()) +``` + +--- + +## Pipeline 2: VLM Pipeline — local inference + +Processes each page as an image through a vision-language model. Replaces the +standard layout detection + OCR stack entirely. + +### CLI usage + +```bash +# Default VLM model (granite_docling) +docling report.pdf --pipeline vlm --output /tmp/ + +# Specific model +docling report.pdf --pipeline vlm --vlm-model smoldocling --output /tmp/ +``` + +### Python API + +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import VlmPipelineOptions +from docling.datamodel import vlm_model_specs +from docling.pipeline.vlm_pipeline import VlmPipeline + +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS, + generate_page_images=True, +) + +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption( + pipeline_cls=VlmPipeline, + pipeline_options=pipeline_options, + ) + } +) +``` + +### Available model presets + +| CLI `--vlm-model` | Python preset (`vlm_model_specs`) | Backend | Device | Notes | +|---|---|---|---|---| +| `granite_docling` | `GRANITEDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Default | +| `smoldocling` | `SMOLDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Lighter | +| (Python API only) | `GRANITEDOCLING_VLLM` | vLLM | GPU | Fast batch | +| (Python API only) | `GRANITEDOCLING_MLX` | MLX | Apple MPS | M-series Macs | + +### Hybrid mode: PDF text + VLM for images/tables + +Set `force_backend_text=True` (Python API only) to use deterministic text +extraction for normal text regions while routing images and tables through the +VLM. Reduces hallucination risk on text-heavy pages. + +```python +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS, + force_backend_text=True, # <-- hybrid mode + generate_page_images=True, +) +``` + +--- + +## Pipeline 3: VLM Pipeline — remote API + +Sends page images to any OpenAI-compatible endpoint. Works with vLLM, +LM Studio, Ollama, or a hosted model API. + +This is available via the CLI with `--pipeline vlm --enable-remote-services`, +but endpoint URL, model name, and API key configuration require the Python API. + +### CLI usage (basic) + +```bash +docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/ +``` + +### Python API (full configuration) + +```python +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import VlmPipelineOptions +from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat +from docling.pipeline.vlm_pipeline import VlmPipeline + +vlm_opts = ApiVlmOptions( + url="http://localhost:8000/v1/chat/completions", + params=dict( + model="ibm-granite/granite-docling-258M", + max_tokens=4096, + ), + headers={"Authorization": "Bearer YOUR_KEY"}, # omit if not needed + prompt="Convert this page to docling.", + response_format=ResponseFormat.DOCTAGS, + timeout=120, + scale=2.0, +) + +pipeline_options = VlmPipelineOptions( + vlm_options=vlm_opts, + generate_page_images=True, + enable_remote_services=True, # required — gates any HTTP call +) + +converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption( + pipeline_cls=VlmPipeline, + pipeline_options=pipeline_options, + ) + } +) +``` + +**`enable_remote_services=True` is mandatory** for API pipelines. Docling +blocks outbound HTTP by default as a safety measure. + +### Common API targets + +| Server | Default URL | Notes | +|---|---|---| +| vLLM | `http://localhost:8000/v1/chat/completions` | Best throughput | +| LM Studio | `http://localhost:1234/v1/chat/completions` | Local dev | +| Ollama | `http://localhost:11434/v1/chat/completions` | Model: `ibm/granite-docling:258m` | +| OpenAI-compatible cloud | Provider URL | Set Authorization header | + +--- + +## VLM install requirements + +Local inference requires PyTorch + Transformers: + +```bash +pip install docling[vlm] +# or manually: +pip install torch transformers accelerate +``` + +MLX (Apple Silicon only): +```bash +pip install mlx mlx-lm +``` + +vLLM backend (server-side): +```bash +pip install vllm +vllm serve ibm-granite/granite-docling-258M +``` diff --git a/docs/examples/agent_skill/docling-document-intelligence/scripts/docling-evaluate.py b/docs/examples/agent_skill/docling-document-intelligence/scripts/docling-evaluate.py new file mode 100644 index 0000000000..a6d9d11d0a --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/scripts/docling-evaluate.py @@ -0,0 +1,296 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: MIT +""" +Evaluate a Docling JSON export and suggest pipeline / option changes. + +Typical flow (agent or human): + + docling input.pdf --to json --output /tmp/ + docling input.pdf --to md --output /tmp/ + python3 scripts/docling-evaluate.py /tmp/input.json --markdown /tmp/input.md + +Exit codes: 0 = pass; 1 = fail or --fail-on-warn with status warn +""" + +from __future__ import annotations + +import argparse +import json +import sys +from collections import Counter +from pathlib import Path +from typing import Any + + +def load_document(path: Path): + data = json.loads(path.read_text(encoding="utf-8")) + try: + from docling_core.types.doc.document import DoclingDocument + + return DoclingDocument.model_validate(data), data + except Exception: + return None, data + + +def page_numbers_from_doc(doc) -> set[int]: + pages: set[int] = set() + for item, _ in doc.iterate_items(): + for prov in getattr(item, "prov", None) or []: + p = getattr(prov, "page_no", None) + if p is not None: + pages.add(int(p)) + return pages + + +def collect_text_samples(doc, limit: int = 200) -> list[str]: + texts: list[str] = [] + for item, _ in doc.iterate_items(): + t = getattr(item, "text", None) + if t and str(t).strip(): + texts.append(str(t).strip()) + if len(texts) >= limit: + break + return texts + + +def metrics_from_doc(doc) -> dict[str, Any]: + n_tables = len(getattr(doc, "tables", []) or []) + n_pictures = len(getattr(doc, "pictures", []) or []) + n_headers = 0 + n_text_items = 0 + total_chars = 0 + for item, _ in doc.iterate_items(): + label = getattr(getattr(item, "label", None), "name", None) or "" + if label == "SECTION_HEADER": + n_headers += 1 + t = getattr(item, "text", None) + if t: + n_text_items += 1 + total_chars += len(str(t)) + + pages = page_numbers_from_doc(doc) + n_pages = len(pages) if pages else 0 + density = (total_chars / n_pages) if n_pages else total_chars + + samples = collect_text_samples(doc) + rep = Counter(samples) + top_rep = rep.most_common(1)[0] if rep else ("", 0) + dup_ratio = ( + sum(c for _, c in rep.items() if c > 2) / max(len(rep), 1) if rep else 0.0 + ) + + md = "" + try: + md = doc.export_to_markdown() + except Exception: + pass + + replacement = md.count("\ufffd") + sum(str(t).count("\ufffd") for t in samples) + + return { + "page_count": n_pages, + "section_headers": n_headers, + "text_items": n_text_items, + "total_text_chars": total_chars, + "chars_per_page": round(density, 2), + "tables": n_tables, + "pictures": n_pictures, + "markdown_chars": len(md), + "replacement_chars": replacement, + "most_repeated_text_count": int(top_rep[1]) if top_rep else 0, + "duplicate_heavy": dup_ratio > 0.15 and len(samples) > 10, + } + + +def heuristic_metrics(data: dict) -> dict[str, Any]: + """Fallback when DoclingDocument cannot be validated (older export / drift).""" + texts = data.get("texts") or [] + tables = data.get("tables") or [] + body = data.get("body") or {} + children = body.get("children") if isinstance(body, dict) else None + n_children = len(children) if isinstance(children, list) else 0 + char_sum = 0 + for t in texts: + if isinstance(t, dict): + char_sum += len(str(t.get("text") or "")) + return { + "page_count": 0, + "section_headers": 0, + "text_items": len(texts), + "total_text_chars": char_sum, + "chars_per_page": 0.0, + "tables": len(tables), + "pictures": len(data.get("pictures") or []), + "markdown_chars": 0, + "replacement_chars": 0, + "most_repeated_text_count": 0, + "duplicate_heavy": False, + "heuristic_only": True, + "body_children": n_children, + } + + +def evaluate( + m: dict[str, Any], + *, + expect_tables: bool, + min_chars_per_page: float, + min_markdown_chars: int, +) -> tuple[str, list[str], list[str]]: + issues: list[str] = [] + actions: list[str] = [] + + if m.get("heuristic_only"): + issues.append("Could not load full DoclingDocument; metrics are partial.") + actions.append( + "Ensure docling-core matches export; re-export with: docling --to json --output " + ) + + cpp = m.get("chars_per_page") or 0 + if m.get("page_count", 0) >= 2 and cpp < min_chars_per_page: + issues.append( + f"Low text density ({cpp} chars/page); likely scan, image-heavy PDF, or extraction gap." + ) + actions.append( + "Retry: docling --ocr-engine tesserocr (or rapidocr, ocrmac)" + ) + actions.append("Retry: docling --pipeline vlm") + + if m.get("replacement_chars", 0) > 5: + issues.append( + "Unicode replacement characters detected; OCR may be garbling text." + ) + actions.append("Retry: docling --ocr-engine tesserocr (or rapidocr)") + actions.append( + "Retry: docling --pipeline vlm (use force_backend_text=True via Python API for hybrid)" + ) + + if m.get("duplicate_heavy") or (m.get("most_repeated_text_count", 0) > 8): + issues.append( + "Repeated text blocks; possible layout/OCR loop or bad reading order." + ) + actions.append("Retry: docling --pipeline vlm") + actions.append( + "If using VLM: try force_backend_text=True via Python API for text-heavy pages" + ) + + if expect_tables and m.get("tables", 0) == 0: + issues.append("No tables detected but tables were expected.") + actions.append( + "Retry: docling (tables are enabled by default; remove --no-tables if set)" + ) + actions.append( + "Retry: docling --pipeline vlm (better for merged-cell or visual tables)" + ) + + mc = m.get("markdown_chars", 0) + if mc > 0 and mc < min_markdown_chars and m.get("page_count", 0) >= 1: + issues.append(f"Markdown export is very short ({mc} chars) for the page count.") + actions.append( + "Retry: docling --pipeline vlm (or try different --ocr-engine)" + ) + + if m.get("text_items", 0) == 0 and m.get("page_count", 0) == 0: + issues.append( + "No text items and no page provenance; export may be empty or invalid." + ) + actions.append( + "Verify source file opens correctly; retry with: docling --pipeline standard" + ) + + seen = set() + uniq_actions = [] + for a in actions: + if a not in seen: + seen.add(a) + uniq_actions.append(a) + + if not issues: + return "pass", [], [] + + severe = m.get("text_items", 0) == 0 or ( + m.get("page_count", 0) >= 1 and mc < 50 and mc > 0 + ) + status = "fail" if severe or m.get("replacement_chars", 0) > 20 else "warn" + return status, issues, uniq_actions + + +def parse_args(): + p = argparse.ArgumentParser(description="Evaluate Docling JSON export quality") + p.add_argument( + "json_path", type=Path, help="Path to DoclingDocument JSON (export_to_dict)" + ) + p.add_argument( + "--markdown", + type=Path, + default=None, + help="Optional markdown file to cross-check length", + ) + p.add_argument("--expect-tables", action="store_true") + p.add_argument("--min-chars-per-page", type=float, default=120.0) + p.add_argument("--min-markdown-chars", type=int, default=200) + p.add_argument("--fail-on-warn", action="store_true") + p.add_argument( + "--quiet", action="store_true", help="Only print JSON report to stdout" + ) + return p.parse_args() + + +def main() -> None: + args = parse_args() + if not args.json_path.is_file(): + print(json.dumps({"error": f"not found: {args.json_path}"}), file=sys.stderr) + sys.exit(1) + + doc, raw = load_document(args.json_path) + if doc is not None: + m = metrics_from_doc(doc) + else: + m = heuristic_metrics(raw) + + if args.markdown and args.markdown.is_file(): + md_len = len(args.markdown.read_text(encoding="utf-8")) + m["markdown_file_chars"] = md_len + if m.get("markdown_chars", 0) == 0: + m["markdown_chars"] = md_len + + status, issues, actions = evaluate( + m, + expect_tables=args.expect_tables, + min_chars_per_page=args.min_chars_per_page, + min_markdown_chars=args.min_markdown_chars, + ) + + report = { + "status": status, + "metrics": m, + "issues": issues, + "recommended_actions": actions, + "next_steps_for_agent": [ + "Re-run docling with flags from recommended_actions.", + "Re-export JSON and run this script again until status is pass.", + "Append a row to improvement-log.md (see SKILL.md).", + ], + } + + print(json.dumps(report, indent=2, ensure_ascii=False)) + if not args.quiet: + print(f"\nstatus={status}", file=sys.stderr) + if issues: + print("issues:", file=sys.stderr) + for i in issues: + print(f" - {i}", file=sys.stderr) + if actions: + print("recommended_actions:", file=sys.stderr) + for a in actions: + print(f" - {a}", file=sys.stderr) + + if status == "fail": + sys.exit(1) + if status == "warn" and args.fail_on_warn: + sys.exit(1) + sys.exit(0) + + +if __name__ == "__main__": + main() diff --git a/docs/examples/agent_skill/docling-document-intelligence/scripts/requirements.txt b/docs/examples/agent_skill/docling-document-intelligence/scripts/requirements.txt new file mode 100644 index 0000000000..21a5e56c5f --- /dev/null +++ b/docs/examples/agent_skill/docling-document-intelligence/scripts/requirements.txt @@ -0,0 +1,3 @@ +# pip install -r scripts/requirements.txt +docling>=2.81.0 +docling-core>=2.67.1 diff --git a/docs/examples/index.md b/docs/examples/index.md index f7d0fdcaac..4e910609e7 100644 --- a/docs/examples/index.md +++ b/docs/examples/index.md @@ -7,6 +7,7 @@ Here some of our picks to get you started: - 📤 [{==\[:fontawesome-solid-flask:{ title="beta feature" } beta\]==} structured data extraction](./extraction.ipynb) - examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb) - 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py) +- 🤝 [**Agent skill**](./agent_skill/docling-document-intelligence/README.md) for Cursor and other assistants (`SKILL.md`, pipeline reference, `docling-convert.py` / `docling-evaluate.py` helpers) 👈 ... and there is much more: explore all the examples using the navigation menu on the side diff --git a/mkdocs.yml b/mkdocs.yml index 029f1d883e..7aba8cbd3a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -80,6 +80,7 @@ nav: - Plugins: concepts/plugins.md - Examples: - Examples: examples/index.md + - "🤝 Agent skill (Cursor / assistants)": examples/agent_skill/docling-document-intelligence/README.md - 🔀 Conversion: - "Simple conversion": examples/minimal.py - "Custom conversion": examples/custom_convert.py