-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
The text element is not exactly as written in pdf.
I have a pdf which consist tables. I am extracting elements for my RAG application with partition_pdf function - hi_res with yolox. It seems a simple text and it repeats in whole pdf but model seems miss one particular spot where actual text is "AUTOSAR Administration" and the element text returned by partition_pdf is "Teton".
To Reproduce
elements = partition_pdf(
filename="mypdf.pdf",
strategy="hi_res",
infer_table_structure=True,
model_name = "yolox"
)
for i, element in enumerate(elements):
print(f"\nElement {i+1}:")
print(f" Page Number: {element.metadata.page_number}")
print(f" Type: {type(element).name}")
print(f" Text: {element.text}")
if isinstance(element, Table):
print(f" This is a Table element. \n {element.metadata.text_as_html}")
elif isinstance(element, Title):
print(f" This is a Title element. Category Depth: {element.metadata.category_depth}")
elif isinstance(element, NarrativeText):
print(" This is a Narrative Text element.")
elif isinstance(element, ListItem):
print(" This is a List Item element.")
Expected behavior
Below should be the table html
| 2007-01-31 | | 2.1.0 | AUTOSAR Administration | e Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and |
| 2006-05-18 | | 2.0.0 | AUTOSAR Administration | Second |
| 2005-05-09 | | 1.0.0 | | AUTOSAR Administration | Initial release |
But instead it is as below (The "Teton" inside in the first line)
| 2007-01-31 | | 2.1.0 | Teton | e Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and |
| 2006-05-18 | | 2.0.0 | AUTOSAR Administration | Second |
| 2005-05-09 | | 1.0.0 | | AUTOSAR Administration | Initial release |
