Open
Description
Unfortunately the latest version of unstructured does not extract tables from files when chunking where it previously used to work.
Version I am using is 0.16.11 (latest which I get right now when installing it)
!wget https://sgp.fas.org/crs/misc/IF10244.pdf
from unstructured.partition.pdf import partition_pdf
filename = './IF10244.pdf'
elements = partition_pdf(filename=filename,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title", # section-based chunking
max_characters=4000,
new_after_n_chars=4000,
combine_text_under_n_chars=2000,
mode='elements',
image_output_dir_path='./figures')
len(elements)
gives me 5 elements, earlier there used to be 7 (with 2 tables)
if I remove chunking I still get the table elements which means table detection works but somehow chunking is combining it and removing the table elements?
Any tips for fixing this? I do not want to mix my tables with the text. Even in 0.16.5 it used to work fine.