Open
Description
The bug exists on the following version:
unstructured 0.16.12
unstructured-inference 0.8.1
Code:
from unstructured.partition.pdf import partition_pdf
input_path = "../input/"
output_path = "../output/"
file_path = input_path + 'attention.pdf'
chunks = partition_pdf(
filename=file_path,
infer_table_structure=True, # extract tables
strategy="hi_res", # mandatory to infer tables
extract_image_block_types=["Image", 'Table'], # Add 'Table' to list to extract image of tables
# image_output_dir_path=output_path, # if None, images and tables will saved in base64
extract_image_block_to_payload=True, # if true, will extract base64 for API usage
chunking_strategy="by_title", # or 'basic'
max_characters=10000, # defaults to 500
combine_text_under_n_chars=2000, # defaults to 0
new_after_n_chars=6000,
# extract_images_in_pdf=True, # deprecated
)
No tables found in the chunks:
[<unstructured.documents.elements.CompositeElement at 0x7f86226bc0d0>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc2e0>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc160>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc280>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc3a0>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc3d0>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc580>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc5b0>,
<unstructured.documents.elements.CompositeElement at 0x7f8621e8e530>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc640>,
<unstructured.documents.elements.CompositeElement at 0x7f86226bc310>,
<unstructured.documents.elements.CompositeElement at 0x7f8621e8d870>]
The SAME code works well with the following version:
unstructured 0.11.5
unstructured-inference 0.7.19
Four tables found:
[<unstructured.documents.elements.CompositeElement at 0x7fdb74e00dc0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74d35060>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e018d0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e012a0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e028c0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e011e0>,
<unstructured.documents.elements.Table at 0x7fdb6c1e02e0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74ccfa00>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e03250>,
<unstructured.documents.elements.Table at 0x7fdb6c210ac0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e024d0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e02830>,
<unstructured.documents.elements.Table at 0x7fdb6c3f49a0>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74ebda20>,
<unstructured.documents.elements.Table at 0x7fdb74d37730>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e01150>,
<unstructured.documents.elements.CompositeElement at 0x7fdb74e00be0>]
Metadata
Metadata
Assignees
Labels
No labels