Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.7.7
0.7.7
Enhancements
- Adds functionality to replace the
MIME
encodings foreml
files with one of the common encodings if aunicode
error occurs - Adds missed file-like object handling in
detect_file_encoding
- Adds functionality to extract charset info from
eml
files
Features
- Added coordinate system class to track coordinate types and convert to different coordinate
Fixes
- Adds an
html_assemble_articles
kwarg topartition_html
to enable users to capture
control whether content outside of<article>
tags is captured when
<article>
tags are present. - Check for the
xml
attribute onelement
before looking for pagebreaks inpartition_docx
.
0.7.6
0.7.6
Enhancements
- Convert fast startegy to ocr_only for images
- Adds support for page numbers in
.docx
and.doc
when user or renderer
created page breaks are present. - Adds retry logic for the unstructured-ingest Biomed connector
Features
- Provides users with the ability to extract additional metadata via regex.
- Updates
partition_docx
to include headers and footers in the output. - Create
partition_tsv
and associated tests. Make additional changes todetect_filetype
.
Fixes
- Remove fake api key in test
partition_via_api
since we now require valid/empty api keys - Page number defaults to
None
instead of1
when page number is not present in the metadata.
A page number ofNone
indicates that page numbers are not being tracked for the document
or that page numbers do not apply to the element in question.. - Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
in case the shape.top and shape.left attributes areNone
.
0.7.5
0.7.5
Enhancements
- Adds functionality to sort elements in
partition_pdf
forfast
strategy - Adds ingest tests with
--fast
strategy on PDF documents - Adds --api-key to unstructured-ingest
Features
- Adds
partition_rst
for processed ReStructured Text documents.
Fixes
- Adds handling for emails that do not have a datetime to extract.
- Adds pdf2image package as core requirement of unstructured (with no extras)
0.7.4
0.7.4
Enhancements
- Allows passing kwargs to request data field for
partition_via_api
andpartition_multiple_via_api
- Enable MIME type detection if libmagic is not available
- Adds handling for empty files in
detect_filetype
andpartition
.
Features
Fixes
- Reslove
grpcio
import issue onweaviate.schema.validate_schema
for python 3.9 and 3.10 - Remove building
detectron2
from source in Dockerfile
0.7.3
0.7.3
Enhancements
- Update IngestDoc abstractions and add data source metadata in ElementMetadata
Features
Fixes
- Pass
strategy
parameter down frompartition
forpartition_image
- Filetype detection if a CSV has a
text/plain
MIME type convert_office_doc
no longers prints file conversion info messages to stdout.partition_via_api
reflects the actual filetype for the file processed in the API.
0.7.2
0.7.2
Enhancements
- Adds an optional encoding kwarg to
elements_to_json
andelements_from_json
- Bump version of base image to use new stable version of tesseract
Features
Fixes
- Update the
read_txt_file
utility function to keep usingspooled_to_bytes_io_if_needed
for xml - Add functionality to the
read_txt_file
utility function to handle file-like object from URL - Remove the unused parameter
encoding
frompartition_pdf
- Change auto.py to have a
None
default for encoding - Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
- Adds benchmark test with test docs in example-docs
- Re-enable test_upload_label_studio_data_with_sdk
- File detection now detects code files as plain text
- Adds
tabulate
explicitly to dependencies - Fixes an issue in
metadata.page_number
of pptx files - Adds showing help if no parameters passed
0.7.1
0.7.1
Enhancements
Features
- Add
stage_for_weaviate
to stageunstructured
outputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas. - Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
Fixes
0.7.0
0.7.0
Enhancements
- Installing
detectron2
from source is no longer required when using thelocal-inference
extra. - Updates
.pptx
parsing to include text in tables.
Features
Fixes
- Fixes an issue in
_add_element_metadata
that caused all elements to havepage_number=1
in the element metadata. - Adds
.log
as a file extension for TXT files. - Adds functionality to try other common encodings for email (
.eml
) files if an error related to the encoding is raised and the user has not specified an encoding. - Allow passed encoding to be used in the
replace_mime_encodings
- Fixes page metadata for
partition_html
wheninclude_metadata=False
- A
ValueError
now raises iffile_filename
is not specified when you usepartition_via_api
with a file-like object.
0.6.11
0.6.11
Enhancements
- Supports epub tests since pandoc is updated in base image