Skip to content

NullPointerException in ListUtils.isContainsHeading during hybrid mode postProcess #220

@odinsseo

Description

@odinsseo

Bug

When using hybrid mode (--hybrid docling-fast, --hybrid-mode auto), the Java CLI crashes with a NullPointerException during the post-processing stage. The docling-fast backend processes pages successfully (returns 200 OK), but the Java process fails in HybridDocumentProcessor.postProcess() and produces no JSON output.

The crash occurs in the verapdf list processing code: same area as #134 but a different method (ListUtils.isContainsHeading vs ListLabelsUtils.haveDifferentSuffixChars). The defensive try-catch added in #134 does not cover this code path.

Stack trace:

SEVERE: Exception during processing file /tmp/tmp0i17_px2/chunk.pdf: null
java.lang.NullPointerException
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.isContainsHeading(ListUtils.java:211)
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.checkChildrenListInterval(ListUtils.java:157)
at org.verapdf.wcag.algorithms.semanticalgorithms.utils.ListUtils.getChildrenListIntervals(ListUtils.java:130)
at org.opendataloader.pdf.processors.ListProcessor.processListsFromTextNodes(ListProcessor.java:350)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.postProcess(HybridDocumentProcessor.java:422)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.processDocument(HybridDocumentProcessor.java:166)
at org.opendataloader.pdf.processors.HybridDocumentProcessor.processDocument(HybridDocumentProcessor.java:78)
at org.opendataloader.pdf.processors.DocumentProcessor.processFile(DocumentProcessor.java:73)
at org.opendataloader.pdf.api.OpenDataLoaderPDF.processFile(OpenDataLoaderPDF.java:32)
at org.opendataloader.pdf.cli.CLIMain.processFile(CLIMain.java:113)
at org.opendataloader.pdf.cli.CLIMain.processPath(CLIMain.java:92)
at org.opendataloader.pdf.cli.CLIMain.main(CLIMain.java:64)

The Java process logs SEVERE but exits with code 0, so the calling Python code receives no error. It only discovers the problem when no JSON output file exists. Reproducible on every document tested (50-page technical manuals with tables and lists). The triage routes ~24 pages to Java and ~26 to docling-fast.

...

Steps to reproduce

  1. Start the hybrid backend:
    opendataloader-pdf-hybrid --port 5002

  2. Convert a PDF with hybrid mode:
    opendataloader-pdf --hybrid docling-fast --hybrid-url http://localhost:5002 --hybrid-timeout 60000 --hybrid-fallback --table-method cluster -f json -o output/ input.pdf

    Or via Python:
    import opendataloader_pdf
    opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="json",
    hybrid="docling-fast",
    hybrid_url="http://localhost:5002",
    hybrid_timeout="60000",
    hybrid_fallback=True,
    table_method="cluster",
    )

  3. The docling-fast backend processes successfully (200 OK in logs).

  4. Java crashes in postProcess with NullPointerException: no JSON output is produced.

...

Version

1.10.1 (pip install opendataloader-pdf[hybrid])

...

Java version

OpenJDK 11 (default-jre-headless, from pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime base image, Ubuntu 22.04)

...

Sub-issues

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions