Skip to content

Overlapping paragraphs of text in a table causes the processing run forever in PDF #1264

Open
docling-project/docling-ibm-models
#93
@cklee1967

Description

@cklee1967

Bug

I have bank statement PDF which has a table for transactions and on this page, there are 6 paragraphs of text that overlaps the table. The convert process never finishes. When I cancel the process, the trackback shows:
...

docling/pipeline/base_pipeline.py", line 45, in execute
conv_res = self._build_document(conv_res)

docling/pipeline/base_pipeline.py", line 163, in _build_document
for p in pipeline_pages: # Must exhaust!

docling/pipeline/base_pipeline.py", line 127, in _apply_on_pages
yield from page_batch

docling/models/page_assemble_model.py", line 68, in call
for page in page_batch:

docling/models/table_structure_model.py", line 257, in call
tf_output = self.tf_predictor.multi_table_predict(

docling_ibm_models/tableformer/data_management/tf_predictor.py", line 485, in multi_table_predict
tf_responses, predict_details = self.predict(

docling_ibm_models/tableformer/data_management/tf_predictor.py", line 815, in predict
matching_details = self._post_processor.process(

docling_ibm_models/tableformer/data_management/matching_post_processor.py", line 1353, in process
aligned_table_cells2 = self._align_table_cells_to_pdf(

docling_ibm_models/tableformer/data_management/matching_post_processor.py", line 559, in _align_table_cells_to_pdf
x1s.append(found_cell["bbox"][0])

Steps to reproduce

Upload a PDF with table (with headers and vertical lines separating columns and have text paragraphs the overlaps all the columns.

Docling version

2.28.4

Python version

3.10.15

Image

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions