Skip to content

Releases: Unstructured-IO/unstructured

0.15.10

10 Sep 12:55
71208ca
Compare
Choose a tag to compare

0.15.10

Enhancements

  • Enhance pdfminer element cleanup Expand removal of pdfminer elements to include those inside all non-pdfminer elements, not just tables.
  • Modified analysis drawing tools to dump to files and draw from dumps If the parameter analysis of the partition_pdf function is set to True, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances.
  • Vectorize pdfminer elements deduplication computation. Use numpy operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.

Features

Fixes

0.15.9

30 Aug 19:13
6ba8135
Compare
Choose a tag to compare

0.15.9

Enhancements

Features

  • Add support for encoding parameter in partition_csv

0.15.8

27 Aug 15:55
4194a07
Compare
Choose a tag to compare

0.15.8

Enhancements

  • Bump unstructured.paddleocr to 2.8.1.0.

Features

  • Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

  • Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
  • Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
  • Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

0.15.7

20 Aug 19:53
01dbc7b
Compare
Choose a tag to compare

0.15.7

Enhancements

Features

Fixes

  • Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

0.15.6

20 Aug 12:47
1f8030d
Compare
Choose a tag to compare

0.15.6

Enhancements

Features

Fixes

  • Bump to NLTK 3.9.x Bumps to the latest nltk version to resolve CVE.
  • Update CI for ingest-test-fixture-update-pr to resolve NLTK model download errors.
  • Synchronized text and html on TableChunk splits. When a Table element is divided during chunking to fit the chunking window, TableChunk.text corresponds exactly with the table text in TableChunk.metadata.text_as_html, .text_as_html is always parseable HTML, and the table is split on even row boundaries whenever possible.

0.15.5

16 Aug 14:35
fc26426
Compare
Choose a tag to compare

0.15.5

Enhancements

Features

Fixes

  • Revert to using unstructured.pytesseract fork. Due to the unavailability of some recent release versions of pytesseract on PyPI, the project now uses the unstructured.pytesseract fork to ensure stability and continued support.
  • Bump libreoffice verson in image. Bumps the libreoffice version to 25.2.5.2 to address CVEs.
  • Downgrade NLTK dependency version for compatibility. Due to the unavailability of nltk==3.8.2 on PyPI, the NLTK dependency has been downgraded to <3.8.2. This change ensures continued functionality and compatibility.

0.15.4

14 Aug 21:18
9b778e2
Compare
Choose a tag to compare

0.15.4

Enhancements

Features

Fixes

  • Resolve an installation error with pytesseract>=0.3.12 that occurred during pip install unstructured[pdf]==0.15.3.

0.15.3

14 Aug 17:23
d6a84bd
Compare
Choose a tag to compare

0.15.3

Enhancements

Features

Fixes

  • Remove the custom index URL from extra-paddleocr.in to resolve the error in the setup.py configuration.

0.15.2

13 Aug 13:40
7437f0a
Compare
Choose a tag to compare

0.15.2

Enhancements

  • Improve directory handling when extracting image blocks. The figures directory is no longer created when the extract_image_block_to_payload parameter is set to True.

Features

  • Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.

Fixes

  • Updates NLTK data file for compatibility with nltk>=3.8.2. The NLTK data file now container punkt_tab, making it possible to upgrade to nltk>=3.8.2. The nltk==3.8.2 patches CVE-2024-39705.
  • Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
  • Accommodate single-column CSV files. Resolves a limitation of partition_csv() where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
  • Accommodate image/jpg in PPTX as alias for image/jpeg. Resolves problem partitioning PPTX files having an invalid image/jpg (should be image/jpeg) MIME-type in the [Content_Types].xml member of the PPTX Zip archive.
  • Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
  • Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.

0.15.1

05 Aug 17:36
7e88744
Compare
Choose a tag to compare

0.15.1

Enhancements

  • Improve pdfminer embedded image extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction in pdf partitioning.

Features

  • Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and Recipient elements are generated for cc and bcc when include_headers=True for email partitioning.
  • Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
  • Add pdf_hi_res_max_pages argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the high_res strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.

Fixes

  • Update HuggingFaceEmbeddingEncoder to use HuggingFaceEmbeddings from langchain_huggingface package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
  • Update OpenAIEmbeddingEncoder to use OpenAIEmbeddings from langchain-openai package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
  • Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
  • File-type detection catches non-existent file-path. detect_filetype() no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead FileNotFoundError is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.
  • EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to partition() as a file-path was identified as TXT and partitioned using partition_text(). EML files specified by path are now identified and processed correctly, including processing any attachments.
  • A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
  • Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to partition() would raise when gzip compression was used for transport by the server.
  • A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling partition() with a swapped MS-Office content_type would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received by partition() is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type.
  • DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.