Skip to content

Releases: Unstructured-IO/unstructured

0.8.6

28 Jul 06:47
84db9c4
Compare
Choose a tag to compare

0.8.6

Enhancements

Features

Fixes

  • Remove debug print lines and non-functional code

0.8.5

27 Jul 18:34
d46c1c2
Compare
Choose a tag to compare

0.8.5

Enhancements

  • Add parameter skip_infer_table_types to enable (skip) table extraction for other doc types
  • Adds optional Unstructured API unit tests in CI
  • Tracks last modified date for all document types.

Features

Fixes

  • NLTK now only gets downloaded if necessary.
  • Handling for empty tables in Word Documents and PowerPoints.

0.8.4

26 Jul 18:09
1e2d531
Compare
Choose a tag to compare

0.8.4

Enhancements

  • Additional tests and refactor of JSON detection.
  • Update functionality to retrieve image metadata from a page for document_to_element_list
  • Links are now tracked in partition_html output.
  • Set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
  • set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add slide notes to pptx
  • Add --encoding directive to ingest
  • Improve json detection by detect_filetype

Features

  • Adds Outlook connector
  • Add support for dpi parameter in inference library
  • Adds Onedrive connector.
  • Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

  • Fixes issue with email partitioning where From field was being assigned the To field value.
  • Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
  • Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
  • Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
  • Adds .txt, .text, and .tab to list of extensions to check if file
    has a text/plain MIME type.
  • Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
  • Removed old error message that's superseded by requires_dependencies.
  • Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

0.8.1: * Add support for Python 3.11

11 Jul 14:35
Compare
Choose a tag to compare

0.8.1

Enhancements

  • Add support for Python 3.11

Features

Fixes

  • Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
  • Fix list detection in MS Word documents.
  • Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

0.8.0

07 Jul 15:41
5e11501
Compare
Choose a tag to compare

Enhancements

  • Allow model used for hi res pdf partition strategy to be chosen when called.
  • Updated inference package

Features

  • Add metadata_filename parameter across all partition functions

Fixes

  • Adjust encoding recognition threshold value in detect_file_encoding

  • Fix KeyError when isd_to_elements doesn't find a type

  • Fix _output_filename for local connector, allowing single files to be written correctly to the disk

  • Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

  • Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

0.7.12

01 Jul 02:32
6249e15
Compare
Choose a tag to compare

0.7.12

Enhancements

  • Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

  • Adds Dropbox connector

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.11

30 Jun 01:42
350bb1d
Compare
Choose a tag to compare

0.7.11

Enhancements

  • More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
  • Make large model available (from unstructured-inference bump to 0.5.3)
  • Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
  • partition_email and partition_msg will now process attachments if process_attachments=True
    and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.10

28 Jun 19:27
44411ec
Compare
Choose a tag to compare

0.7.10

Enhancements

  • Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
    partition_msg and partition_xml that sets a limit for the size of an individual
    document elements. Defaults to 1500 for everything except partition_xml, which has
    a default value of None.
  • DRY connector refactor

Features

  • hi_res model for pdfs and images is selectable via environment variable.

Fixes

  • CSV check now ignores escaped commas.
  • Fix for filetype exploration util when file content does not have a comma.
  • Adds negative lookahead to bullet pattern to avoid detecting plain text line
    breaks like ------- as list items.
  • Fix pre tag parsing for partition_html
  • Fix lookup error for annotated Arabic and Hebrew encodings

0.7.9

26 Jun 21:54
95f02f2
Compare
Choose a tag to compare

0.7.9

Enhancements

  • Improvements to string check for leafs in partition_xml.
  • Adds --partition-ocr-languages to unstructured-ingest.

Features

  • Adds partition_org for processed Org Mode documents.

Fixes

0.7.8

23 Jun 02:23
5f5da65
Compare
Choose a tag to compare

0.7.8

Enhancements

Features

  • Adds Google Cloud Service connector

Fixes

  • Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
  • partition_email now works if there is no message content
  • Updates the "fast" strategy for partition_pdf so that it's able to recursively
  • Adds recursive functionality to all fsspec connectors
  • Adds generic --recursive ingest flag