Skip to content

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

Open
@remi-braun

Description

@remi-braun

Extracting text used to extract all words, now at least one is missing from the bounding box

Environment

Both Linux and Windows.
v5.0.1 has been tested and is fine.

Code + PDF

With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf

def extract_map_text(
    page: PageObject,
    x_min: float = 0,
    x_max: float = 1,
    y_min: float = 0,
    y_max: float = 1,
    sep=";",
):
    """
    Extract the text from the given page (in PDF)
    Args:
        page (PageObject): PDF page
        x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it

    Returns:
        str: Extracted text

    """
    parts = []

    def visitor_right(text, cm, tm, font_dict, font_size):
        x = tm[4]
        y = tm[5]
        in_window = (
            float(x_max * float(page.cropbox.right))
            > x
            > float(x_min * float(page.cropbox.right))
        ) and (
            float(y_max * float(page.cropbox.top))
            > y
            > float(y_min * float(page.cropbox.top))
        )
        if in_window and text not in ["!", "", " "]:
            parts.append(text)

    page.extract_text(orientations=0, visitor_text=visitor_right)
    page_txt = (
        sep.join([p for p in parts if p not in ["\n"]])
        .replace("\n", " ")
        .replace("\x00", "")
        .replace("\xa0", " ")
    )
    return page_txt

Running this snippet:

extract_map_text(
    page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace("  ", " ")

With pypdf v5.1.0, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

With pypdf v5.0.1, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-regressionRegression introduced as a side-effect of another changeworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions