Open
Description
Extracting text used to extract all words, now at least one is missing from the bounding box
Environment
Both Linux and Windows.
v5.0.1 has been tested and is fine.
Code + PDF
With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf
def extract_map_text(
page: PageObject,
x_min: float = 0,
x_max: float = 1,
y_min: float = 0,
y_max: float = 1,
sep=";",
):
"""
Extract the text from the given page (in PDF)
Args:
page (PageObject): PDF page
x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it
Returns:
str: Extracted text
"""
parts = []
def visitor_right(text, cm, tm, font_dict, font_size):
x = tm[4]
y = tm[5]
in_window = (
float(x_max * float(page.cropbox.right))
> x
> float(x_min * float(page.cropbox.right))
) and (
float(y_max * float(page.cropbox.top))
> y
> float(y_min * float(page.cropbox.top))
)
if in_window and text not in ["!", "", " "]:
parts.append(text)
page.extract_text(orientations=0, visitor_text=visitor_right)
page_txt = (
sep.join([p for p in parts if p not in ["\n"]])
.replace("\n", " ")
.replace("\x00", "")
.replace("\xa0", " ")
)
return page_txt
Running this snippet:
extract_map_text(
page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace(" ", " ")
With pypdf v5.1.0, the output is:
'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'
With pypdf v5.0.1, the output is:
'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'
The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.