Skip to content

0.6.7

Compare
Choose a tag to compare
@MthwRobinson MthwRobinson released this 19 May 17:31
· 1327 commits to main since this release
046af73

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify
    partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast"
    or "hi_res" strategies if pytesseract is not installed. Also adds the
    required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.