Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.8.6
0.8.5
0.8.5
Enhancements
- Add parameter
skip_infer_table_types
to enable (skip) table extraction for other doc types - Adds optional Unstructured API unit tests in CI
- Tracks last modified date for all document types.
Features
Fixes
- NLTK now only gets downloaded if necessary.
- Handling for empty tables in Word Documents and PowerPoints.
0.8.4
0.8.4
Enhancements
- Additional tests and refactor of JSON detection.
- Update functionality to retrieve image metadata from a page for
document_to_element_list
- Links are now tracked in
partition_html
output. - Set the file's current position to the beginning after reading the file in
convert_to_bytes
- Add
min_partition
kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split. - set the file's current position to the beginning after reading the file in
convert_to_bytes
- Add slide notes to pptx
- Add
--encoding
directive to ingest - Improve json detection by
detect_filetype
Features
- Adds Outlook connector
- Add support for dpi parameter in inference library
- Adds Onedrive connector.
- Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.
Fixes
- Fixes issue with email partitioning where From field was being assigned the To field value.
- Use the
image_metadata
property of thePageLayout
instance to get the page image info in thedocument_to_element_list
- Add functionality to write images to computer storage temporarily instead of keeping them in memory for
ocr_only
strategy - Add functionality to convert a PDF in small chunks of pages at a time for
ocr_only
strategy - Adds
.txt
,.text
, and.tab
to list of extensions to check if file
has atext/plain
MIME type. - Enables filters to be passed to
partition_doc
so it doesn't error with LibreOffice7. - Removed old error message that's superseded by
requires_dependencies
. - Removes using
hi_res
as the default strategy value forpartition_via_api
andpartition_multiple_via_api
0.8.1: * Add support for Python 3.11
0.8.1
Enhancements
- Add support for Python 3.11
Features
Fixes
- Fixed
auto
strategy detected scanned document as having extractable text and usingfast
strategy, resulting in no output. - Fix list detection in MS Word documents.
- Don't instantiate an element with a coordinate system when there isn't a way to get its location data.
0.8.0
Enhancements
- Allow model used for hi res pdf partition strategy to be chosen when called.
- Updated inference package
Features
- Add metadata_filename parameter across all partition functions
Fixes
-
Adjust encoding recognition threshold value in
detect_file_encoding
-
Fix KeyError when
isd_to_elements
doesn't find a type -
Fix _output_filename for local connector, allowing single files to be written correctly to the disk
-
Fix for cases where an invalid encoding is extracted from an email header.
BREAKING CHANGES
- Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the
coordinates
attribute of the element's metadata.
0.7.12
0.7.12
Enhancements
- Adds
include_metadata
kwarg topartition_doc
,partition_docx
,partition_email
,partition_epub
,partition_json
,partition_msg
,partition_odt
,partition_org
,partition_pdf
,partition_ppt
,partition_pptx
,partition_rst
, andpartition_rtf
Features
- Adds Dropbox connector
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.11
0.7.11
Enhancements
- More deterministic element ordering when using
hi_res
PDF parsing strategy (from unstructured-inference bump to 0.5.4) - Make large model available (from unstructured-inference bump to 0.5.3)
- Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
partition_email
andpartition_msg
will now process attachments ifprocess_attachments=True
and a attachment partitioning functions is passed through withattachment_partitioner=partition
.
Features
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.10
0.7.10
Enhancements
- Adds a
max_partition
parameter topartition_text
,partition_pdf
,partition_email
,
partition_msg
andpartition_xml
that sets a limit for the size of an individual
document elements. Defaults to1500
for everything exceptpartition_xml
, which has
a default value ofNone
. - DRY connector refactor
Features
hi_res
model for pdfs and images is selectable via environment variable.
Fixes
- CSV check now ignores escaped commas.
- Fix for filetype exploration util when file content does not have a comma.
- Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like-------
as list items. - Fix pre tag parsing for
partition_html
- Fix lookup error for annotated Arabic and Hebrew encodings
0.7.9
0.7.8
0.7.8
Enhancements
Features
- Adds Google Cloud Service connector
Fixes
- Updates the
parse_email
forpartition_eml
so thatunstructured-api
passes the smoke tests partition_email
now works if there is no message content- Updates the
"fast"
strategy forpartition_pdf
so that it's able to recursively - Adds recursive functionality to all fsspec connectors
- Adds generic --recursive ingest flag