-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Bug
When using hybrid mode (--hybrid docling-fast, --hybrid-mode auto), the Java CLI crashes with a NullPointerException during the post-processing stage. The docling-fast backend processes pages successfully (returns 200 OK), but the Java process fails in HybridDocumentProcessor.postProcess() and produces no JSON output.
The crash occurs in the verapdf list processing code: same area as #134 but a different method (ListUtils.isContainsHeading vs ListLabelsUtils.haveDifferentSuffixChars). The defensive try-catch added in #134 does not cover this code path.
Stack trace:
SEVERE: Exception during processing file /tmp/tmp0i17_px2/chunk.pdf: null
java.lang.NullPointerException
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.isContainsHeading(ListUtils.java:211)
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.checkChildrenListInterval(ListUtils.java:157)
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.getChildrenListIntervals(ListUtils.java:130)
at org.opendataloader.pdf.processors.ListProcessor.processListsFromTextNodes(ListProcessor.java:350)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.postProcess(HybridDocumentProcessor.java:422)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.processDocument(HybridDocumentProcessor.java:166)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.processDocument(HybridDocumentProcessor.java:78)
at org.opendataloader.pdf.processors.DocumentProcessor.processFile(DocumentProcessor.java:73)
at org.opendataloader.pdf.api.OpenDataLoaderPDF.processFile(OpenDataLoaderPDF.java:32)
at org.opendataloader.pdf.cli.CLIMain.processFile(CLIMain.java:113)
at org.opendataloader.pdf.cli.CLIMain.processPath(CLIMain.java:92)
at org.opendataloader.pdf.cli.CLIMain.main(CLIMain.java:64)
The Java process logs SEVERE but exits with code 0, so the calling Python code receives no error. It only discovers the problem when no JSON output file exists. Reproducible on every document tested (50-page technical manuals with tables and lists). The triage routes ~24 pages to Java and ~26 to docling-fast.
...
Steps to reproduce
-
Start the hybrid backend:
opendataloader-pdf-hybrid --port 5002 -
Convert a PDF with hybrid mode:
opendataloader-pdf --hybrid docling-fast --hybrid-url http://localhost:5002 --hybrid-timeout 60000 --hybrid-fallback --table-method cluster -f json -o output/ input.pdfOr via Python:
import opendataloader_pdf
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="json",
hybrid="docling-fast",
hybrid_url="http://localhost:5002",
hybrid_timeout="60000",
hybrid_fallback=True,
table_method="cluster",
) -
The docling-fast backend processes successfully (200 OK in logs).
-
Java crashes in postProcess with NullPointerException: no JSON output is produced.
...
Version
1.10.1 (pip install opendataloader-pdf[hybrid])
...
Java version
OpenJDK 11 (default-jre-headless, from pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime base image, Ubuntu 22.04)
...