Skip to content

Support rotated pages with extraction_mode="layout" #3270

Open
@hackowitz-af

Description

@hackowitz-af

Explanation

When extracting text from rotated pages, the current options limit useful extraction in layout mode.

  • If strip_rotated=True, a warning is issued and there is no output.
  • If strip_rotated=False, a warning is issued and the output is garbled.

I propose to add an optional orientation: {"infer", 0, 90, 180, 270} = "infer"} to PageObject.extract_text. infer could either use the page['/Rotate'] or use the actual rotation of the text. The names orientation, layout_mode_orientation, rotation, etc. are all the same to me.

I think it's best to add a keyword argument rather than to implicitly use the page['/Rotate'], so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.

rotated-page.pdf

Code Example

from pypdf import PdfReader
reader = PdfReader("./rotated-page.pdf")

# all to the same effect, for a 90-degree rotated page...
reader.pages[0].extract_text(extraction_mode="layout")
reader.pages[0].extract_text(extraction_mode="layout", orientation="infer")
reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

# to collect different sections of a page, while preserving the layout of each.
header = reader.pages[0].extract_text(extraction_mode="layout", orientation=0)
body = reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-featureA feature requestworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions