Open
Description
Explanation
When extracting text from rotated pages, the current options limit useful extraction in layout mode.
- If
strip_rotated=True
, a warning is issued and there is no output. - If
strip_rotated=False
, a warning is issued and the output is garbled.
I propose to add an optional orientation: {"infer", 0, 90, 180, 270} = "infer"}
to PageObject.extract_text
. infer
could either use the page['/Rotate']
or use the actual rotation of the text. The names orientation
, layout_mode_orientation
, rotation
, etc. are all the same to me.
I think it's best to add a keyword argument rather than to implicitly use the page['/Rotate']
, so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.
Code Example
from pypdf import PdfReader
reader = PdfReader("./rotated-page.pdf")
# all to the same effect, for a 90-degree rotated page...
reader.pages[0].extract_text(extraction_mode="layout")
reader.pages[0].extract_text(extraction_mode="layout", orientation="infer")
reader.pages[0].extract_text(extraction_mode="layout", orientation=90)
# to collect different sections of a page, while preserving the layout of each.
header = reader.pages[0].extract_text(extraction_mode="layout", orientation=0)
body = reader.pages[0].extract_text(extraction_mode="layout", orientation=90)