You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
0.6.7
Enhancements
Deprecate --s3-url in favor of --remote-url in CLI
Refactor out non-connector-specific config variables
Add file_directory to metadata
Add page_name to metadata. Currently used for the sheet name in XLSX documents.
Added a --partition-strategy parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example, --partition-strategy fast.
Added metadata for filetype.
Add Discord connector to pull messages from a list of channels
Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.
Features
Add partition_xml for XML files.
Add partition_xlsx for Microsoft Excel documents.
Fixes
Supports hml filetype for partition as a variation of html filetype.
Makes pytesseract a function level import in partition_pdf so you can use the "fast"
or "hi_res" strategies if pytesseract is not installed. Also adds the required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
Fix to ensure filename is tracked in metadata for docx tables.