Skip to content

0.5.4

Compare
Choose a tag to compare
@MthwRobinson MthwRobinson released this 14 Mar 15:54
· 1465 commits to main since this release
e43cb0e

0.5.4

Enhancements

  • Added Biomedical literature connector for ingest cli.
  • Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
  • Rename s3_connector.py to s3.py for readability and consistency with the
    rest of the connectors.
  • Now S3Connector relies on s3fs instead of on boto3, and it inherits
    from FsspecConnector.
  • Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language
    specific checks like vocabulary and POS tagging are applied. Set to "true" for higher
    resolution partitioning and "false" for faster processing.
  • Improves detect_filetype warning to include filename when provided.
  • Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast"
    strategy if detectron2 is not available.
  • Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in
    favor of --remote-url.

Features

  • Add AzureBlobStorageConnector based on its fsspec implementation inheriting
    from FsspecConnector
  • Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

  • Fixes processing for text files with message/rfc822 MIME type.
  • Open xml files in read-only mode when reading contents to construct an XMLDocument.