Description
Describe the bug
This is all on my Macbook. I have created a Numbers file and on the spreadsheet, I put a few rows, and each row has 2 columns. then I copy and pasted the rows into Pages app, so it becomes a table in pages. Then I export it to Word document. On the Word document, I could see the table. However, when I try to use unstructured to extract info, it was not able to.
To Reproduce
Here is the code I use to extract info, but not able to extract.
from unstructured.partition.docx import partition_docx
import os
def extract_elements_from_docx(file_path: str):
if not os.path.exists(file_path):
print(f"File not found: {file_path}")
return
try:
elements = partition_docx(file=file_path,
infer_table_structure=True,
include_page_breaks=True,
content_extraction=True)
if not elements:
print("No elements were extracted from the document.")
return
print("Extracted Elements:")
for element in elements:
if element.type == "Table":
print("Table Detected:")
print(element.to_dict())
else:
print(f"{element.type}: {element.text}")
except Exception as e:
print(f"An error occurred while extracting elements: {e}")
if __name__ == "__main__":
docx_file_path = "copied_from_numbers.docx"
extract_elements_from_docx(docx_file_path)
Expected behavior
The table should be extracted.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
Please run python scripts/collect_env.py
and paste the output here.
This will help us understand more about the environment in which the bug occurred.
Additional context