OCR PDF Structure Issue #1506
bikramnayak
started this conversation in
General
Replies: 1 comment
-
We depend on Tesseract or the OCR engine to generate usable information. I think one would get better results from improving the OCR engine than trying to repair the layout at OCRmyPDF's level. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I’m working with a scanned PDF that contains a table with two columns, where each column has two lines of text. When I convert the scanned PDF using OCRmyPDF, I’m encountering an issue with the resulting content. Tesseract processes the text line by line, but this causes OCRmyPDF to generate separate spans for each piece of content. Specifically, it creates a span for row 1, cell 1, then another span for row 1, cell 2, followed by separate spans for row 2, cell 1, and row 2, cell 2. This results in accessibility problems for screen readers, as the content is not structured properly. Is there any way to resolve this issue and ensure the table is interpreted correctly by screen readers?
Beta Was this translation helpful? Give feedback.
All reactions