0.11.4

cragwolfe released this 15 Dec 01:12

· 669 commits to main since this release

8ba1bed

0.11.4

Enhancements

Refactor image extraction code. The image extraction code is moved from unstructured-inference to unstructured.
Refactor pdfminer code. The pdfminer code is moved from unstructured-inference to unstructured.
Improve handling of auth data for fsspec connectors. Leverage an extension of the dataclass paradigm to support a sensitive annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
Add glob support for fsspec connectors Similar to the glob support in the ingest local source connector, similar filters are now enabled on all fsspec based source connectors to limit files being partitioned.
Define a constant for the splitter "+" used in tesseract ocr languages.

Features

Save tables in PDF's separately as images. The "table" elements are saved as table-<pageN>-<tableN>.jpg. This filename is presented in the image_path metadata field for the Table element. The default would be to not do this.
Add Weaviate destination connector Weaviate connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to a Weaviate object collection.
Sftp Source Connector. New source connector added to support downloading/partitioning files from Sftp.

Fixes

Fix pdf hi_res partitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the hi_res strategy.
Fix a bug where image can be scaled too large for tesseract Adds a limit to prevent auto-scaling an image beyond the maximum size tesseract can handle for ocr layout detection
Update partition_csv to handle different delimiters CSV files containing both non-comma delimiters and commas in the data were throwing an error in Pandas. partition_csv now identifies the correct delimiter before the file is processed.
partition returning cid code in hi_res occasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.

Assets 2