Skip to content

bug/incorrect text extraction by partition_pdf with hi_res strategy #4092

@VishwaRajput

Description

@VishwaRajput

Describe the bug
The text element is not exactly as written in pdf.
I have a pdf which consist tables. I am extracting elements for my RAG application with partition_pdf function - hi_res with yolox. It seems a simple text and it repeats in whole pdf but model seems miss one particular spot where actual text is "AUTOSAR Administration" and the element text returned by partition_pdf is "Teton".

To Reproduce
elements = partition_pdf(
filename="mypdf.pdf",
strategy="hi_res",
infer_table_structure=True,
model_name = "yolox"
)

for i, element in enumerate(elements):
print(f"\nElement {i+1}:")
print(f" Page Number: {element.metadata.page_number}")
print(f" Type: {type(element).name}")
print(f" Text: {element.text}")

if isinstance(element, Table):
    print(f"  This is a Table element. \n {element.metadata.text_as_html}")
elif isinstance(element, Title):
    print(f"  This is a Title element. Category Depth: {element.metadata.category_depth}")
elif isinstance(element, NarrativeText):
    print("  This is a Narrative Text element.")
elif isinstance(element, ListItem):
    print("  This is a List Item element.")

Expected behavior
Below should be the table html

2007-01-31 |2.1.0AUTOSAR Administratione Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and
2006-05-18 |2.0.0AUTOSAR AdministrationSecond
2005-05-09| 1.0.0| AUTOSAR AdministrationInitial release

But instead it is as below (The "Teton" inside in the first line)

2007-01-31 |2.1.0Tetone Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and
2006-05-18 |2.0.0AUTOSAR AdministrationSecond
2005-05-09| 1.0.0| AUTOSAR AdministrationInitial release

Screenshots
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions