Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.15.10
0.15.10
Enhancements
- Enhance
pdfminer
element cleanup Expand removal ofpdfminer
elements to include those inside allnon-pdfminer
elements, not justtables
. - Modified analysis drawing tools to dump to files and draw from dumps If the parameter
analysis
of thepartition_pdf
function is set toTrue
, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances. - Vectorize pdfminer elements deduplication computation. Use
numpy
operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.
Features
Fixes
0.15.9
0.15.9
Enhancements
Features
- Add support for encoding parameter in partition_csv
0.15.8
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
Fixes
- Replace
pillow-heif
withpi-heif
. Replacespillow-heif
withpi-heif
due to more permissive licensing on the wheel forpi-heif
. - Minify text_as_html from DOCX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetype
was incorrectly identified as a MSG file.
0.15.7
0.15.7
Enhancements
Features
Fixes
- Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
0.15.6
0.15.6
Enhancements
Features
Fixes
- Bump to NLTK 3.9.x Bumps to the latest
nltk
version to resolve CVE. - Update CI for
ingest-test-fixture-update-pr
to resolve NLTK model download errors. - Synchronized text and html on
TableChunk
splits. When aTable
element is divided during chunking to fit the chunking window,TableChunk.text
corresponds exactly with the table text inTableChunk.metadata.text_as_html
,.text_as_html
is always parseable HTML, and the table is split on even row boundaries whenever possible.
0.15.5
0.15.5
Enhancements
Features
Fixes
- Revert to using
unstructured.pytesseract
fork. Due to the unavailability of some recent release versions ofpytesseract
on PyPI, the project now uses theunstructured.pytesseract
fork to ensure stability and continued support. - Bump
libreoffice
verson in image. Bumps thelibreoffice
version to25.2.5.2
to address CVEs. - Downgrade NLTK dependency version for compatibility. Due to the unavailability of
nltk==3.8.2
on PyPI, the NLTK dependency has been downgraded to<3.8.2
. This change ensures continued functionality and compatibility.
0.15.4
0.15.4
Enhancements
Features
Fixes
- Resolve an installation error with
pytesseract>=0.3.12
that occurred duringpip install unstructured[pdf]==0.15.3
.
0.15.3
0.15.3
Enhancements
Features
Fixes
- Remove the custom index URL from
extra-paddleocr.in
to resolve the error in thesetup.py
configuration.
0.15.2
0.15.2
Enhancements
- Improve directory handling when extracting image blocks. The
figures
directory is no longer created when theextract_image_block_to_payload
parameter is set toTrue
.
Features
- Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.
Fixes
- Updates NLTK data file for compatibility with
nltk>=3.8.2
. The NLTK data file now containerpunkt_tab
, making it possible to upgrade tonltk>=3.8.2
. Thenltk==3.8.2
patches CVE-2024-39705. - Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
- Accommodate single-column CSV files. Resolves a limitation of
partition_csv()
where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters). - Accommodate
image/jpg
in PPTX as alias forimage/jpeg
. Resolves problem partitioning PPTX files having an invalidimage/jpg
(should beimage/jpeg
) MIME-type in the[Content_Types].xml
member of the PPTX Zip archive. - Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
- Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.
0.15.1
0.15.1
Enhancements
- Improve
pdfminer
embeddedimage
extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction inpdf
partitioning.
Features
- Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and
Recipient
elements are generated for cc and bcc wheninclude_headers=True
for email partitioning. - Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
- Add
pdf_hi_res_max_pages
argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when thehigh_res
strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.
Fixes
- Update
HuggingFaceEmbeddingEncoder
to useHuggingFaceEmbeddings
fromlangchain_huggingface
package instead of the deprecated version fromlangchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update
OpenAIEmbeddingEncoder
to useOpenAIEmbeddings
fromlangchain-openai
package instead of the deprecated version fromlangchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
- File-type detection catches non-existent file-path.
detect_filetype()
no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. InsteadFileNotFoundError
is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened. - EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to
partition()
as a file-path was identified as TXT and partitioned usingpartition_text()
. EML files specified by path are now identified and processed correctly, including processing any attachments. - A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
- Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to
partition()
would raise whengzip
compression was used for transport by the server. - A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling
partition()
with a swapped MS-Officecontent_type
would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received bypartition()
is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type. - DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.