Skip to content

Releases: Unstructured-IO/unstructured

0.6.9

24 May 22:31
c82bad1
Compare
Choose a tag to compare

0.6.9

Enhancements

  • fast strategy for pdf now keeps element bounding box data
  • setup.py refactor

Features

Fixes

  • Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds additional MIME types for CSV

0.6.8

19 May 19:58
21c821d
Compare
Choose a tag to compare

0.6.8

Enhancements

Features

  • Add partition_csv for CSV files.

Fixes

0.6.7

19 May 17:31
046af73
Compare
Choose a tag to compare

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify
    partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast"
    or "hi_res" strategies if pytesseract is not installed. Also adds the
    required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.

0.6.6

12 May 17:47
727d366
Compare
Choose a tag to compare

0.6.6

Enhancements

  • Adds an "auto" strategy that chooses the partitioning strategy based on document
    characteristics and function kwargs. This is the new default strategy for partition_pdf
    and partition_image. Users can maintain existing behavior by explicitly setting
    strategy="hi_res".
  • Added an additional trace logger for NLP debugging.
  • Add get_date method to ElementMetadata for converting the datestring to a datetime object.
  • Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

  • Added table reading as html with URL parsing to partition_docx in docx
  • Added metadata field for text_as_html for docx files

Fixes

  • fileutils/file_type check json and eml decode ignore error
  • partition_email was updated to more flexibly handle deviations from the RFC-2822 standard.
    The time in the metadata returns None if the time does not match RFC-2822 at all.
  • Include all metadata fields when converting to dataframe or CSV

0.6.5

10 May 04:40
b52638f
Compare
Choose a tag to compare

0.6.5

Enhancements

  • Added support for SpooledTemporaryFile file argument.

Features

Fixes

0.6.4

08 May 17:57
3d3f3df
Compare
Choose a tag to compare

0.6.4

Enhancements

  • Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision
    logic into its own module.

Features

Fixes

0.6.3

04 May 20:25
392cccd
Compare
Choose a tag to compare

0.6.3

Enhancements

  • Add an "ocr_only" strategy for partition_image.

Features

  • Added partition_multiple_via_api for partitioning multiple documents in a single REST
    API call.
  • Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
  • Added partition_odt for processing Open Office documents.

Fixes

  • Updates the grouping logic in the partition_pdf fast strategy to group together text
    in the same bounding box.

0.6.2

26 Apr 20:31
Compare
Choose a tag to compare

0.6.2

Enhancements

  • Added logic to partition_pdf for detecting copy protected PDFs and falling back
    to the hi res strategy when necessary.

Features

  • Add partition_via_api for partitioning documents through the hosted API.

Fixes

  • Fix how exceeds_cap_ratio handles empty (returns True instead of False)
  • Updates detect_filetype to properly detect JSONs when the MIME type is text/plain.

0.6.1

21 Apr 18:49
5b6640a
Compare
Choose a tag to compare

0.6.1

Enhancements

  • Updated the table extraction parameter name to be more descriptive

Features

Fixes

0.6.0

21 Apr 17:11
dc4147d
Compare
Choose a tag to compare

0.6.0

Enhancements

  • Adds an ssl_verify kwarg to partition and partition_html to enable turning off
    SSL verification for HTTP requests. SSL verification is on by default.
  • Allows users to pass in ocr language to partition_pdf and partition_image through
    the ocr_language kwarg. ocr_language corresponds to the code for the language pack
    in Tesseract. You will need to install the relevant Tesseract language pack to use a
    given language.

Features

  • Table extraction is now possible for pdfs from partition and partition_pdf.
  • Adds support for extracting attachments from .msg files

Fixes