OCR PDF Structure Issue #1506

bikramnayak · 2025-04-03T09:02:53Z

bikramnayak
Apr 3, 2025

I’m working with a scanned PDF that contains a table with two columns, where each column has two lines of text. When I convert the scanned PDF using OCRmyPDF, I’m encountering an issue with the resulting content. Tesseract processes the text line by line, but this causes OCRmyPDF to generate separate spans for each piece of content. Specifically, it creates a span for row 1, cell 1, then another span for row 1, cell 2, followed by separate spans for row 2, cell 1, and row 2, cell 2. This results in accessibility problems for screen readers, as the content is not structured properly. Is there any way to resolve this issue and ensure the table is interpreted correctly by screen readers?

jbarlow83 · 2025-04-11T21:00:12Z

jbarlow83
Apr 11, 2025
Maintainer

We depend on Tesseract or the OCR engine to generate usable information. I think one would get better results from improving the OCR engine than trying to repair the layout at OCRmyPDF's level.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR PDF Structure Issue #1506

{{title}}

Replies: 1 comment

{{title}}

Select a reply

OCR PDF Structure Issue #1506

bikramnayak Apr 3, 2025

Replies: 1 comment

jbarlow83 Apr 11, 2025 Maintainer

bikramnayak
Apr 3, 2025

jbarlow83
Apr 11, 2025
Maintainer