Skip to content

feat: switch to threaded PDF pipeline (docling 2.99)#163

Merged
selloriwoo merged 1 commit into
mainfrom
feat/docling-sys-threaded-pdf-pipeline
Jun 11, 2026
Merged

feat: switch to threaded PDF pipeline (docling 2.99)#163
selloriwoo merged 1 commit into
mainfrom
feat/docling-sys-threaded-pdf-pipeline

Conversation

@selloriwoo

@selloriwoo selloriwoo commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Switch the docling PDF converter to docling's threaded pipeline (ThreadedStandardPdfPipeline + ThreadedPdfPipelineOptions), which processes pages through concurrent stages (parse / layout / table) connected by bounded queues. Multi-page PDFs convert faster than the sequential StandardPdfPipeline.

The threaded backend requires docling-parse v6 (PR docling-project/docling#3377, landed in docling 2.96), so this also bumps docling.

Changes

  • python/run_docling.py: build_converter now wires ThreadedStandardPdfPipeline via PdfFormatOption(pipeline_cls=...) and builds ThreadedPdfPipelineOptions. Option fields and defaults are unchanged.
  • python/pyproject.toml: docling>=2.91.0>=2.99.0.
  • python/uv.lock: docling 2.99.0, docling-parse 6.2.0 (+ dependency restructure into docling-slim; macOS-only ocrmac/pyobjc dropped in favour of rapidocr, matching docling 2.99 defaults).

Test

  • cargo test -p agent-k --features internal translate_pdf_from_financebench -- --ignored passes (283s): builds
    the PyInstaller bundle with docling-parse v6 and converts a real
    FinanceBench PDF end-to-end through the Rust → bundle path.
  • Threaded pipeline confirmed active (pipeline_cls = ThreadedStandardPdfPipeline).

@selloriwoo selloriwoo requested review from jhlee525 and nuri-yoo June 9, 2026 05:43
@selloriwoo selloriwoo self-assigned this Jun 9, 2026
@selloriwoo selloriwoo requested a review from grf53 June 9, 2026 05:52

@nuri-yoo nuri-yoo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@grf53

grf53 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Before merge, it seems essential to not only pass the translate_pdf_from_financebench test but also verify whether the same results (or results without quality degradation) are obtained based on existing Docling standards for several PDF documents (sampling from those at https://github.com/brekkylab/knowledge-base-examples?).

Additionally and optionally, approximate range of performance improvement?

@selloriwoo

selloriwoo commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Spec: Ryzen 5600X, RAM: 32GB, NVIDIA RTX 3070 (8 GiB)

PDF Pages 2.91 (s) 2.99 (s) Speedup
10K 158 125.28 113.29 1.11×
10K 165 131.75 111.38 1.18×
10K 250 120.54 80.72 1.49×
10K 346 157.94 114.78 1.38×
10K 107 85.58 63.95 1.34×

This is the difference between ver 2.91 (baseline) and 2.99 (multithread on), based on 5 PDFs.

Our experiments show no difference in output between the single-threaded(2.91) and multithreaded(2.99) runs.

@grf53 grf53 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your confirmations!

@selloriwoo selloriwoo merged commit 292cf78 into main Jun 11, 2026
@selloriwoo selloriwoo deleted the feat/docling-sys-threaded-pdf-pipeline branch June 11, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants