Skip to content

feat: add option to segregate Table elements in their own chunk/split #3827

Open
@dipanjanS

Description

@dipanjanS

Unfortunately the latest version of unstructured does not extract tables from files when chunking where it previously used to work.

Version I am using is 0.16.11 (latest which I get right now when installing it)

!wget https://sgp.fas.org/crs/misc/IF10244.pdf

from unstructured.partition.pdf import partition_pdf
filename = './IF10244.pdf'
elements = partition_pdf(filename=filename,
                               strategy='hi_res',
                               extract_images_in_pdf=True,
                               infer_table_structure=True,
                               chunking_strategy="by_title", # section-based chunking
                               max_characters=4000,
                               new_after_n_chars=4000, 
                               combine_text_under_n_chars=2000, 
                               mode='elements',
                               image_output_dir_path='./figures')

len(elements)

gives me 5 elements, earlier there used to be 7 (with 2 tables)

if I remove chunking I still get the table elements which means table detection works but somehow chunking is combining it and removing the table elements?

Any tips for fixing this? I do not want to mix my tables with the text. Even in 0.16.5 it used to work fine.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions