Skip to content

Releases: Unstructured-IO/unstructured

0.10.4

18 Aug 21:01
dd243b4
Compare
Choose a tag to compare

0.10.4

Enhancements

  • Adds ability to reuse connections per process in unstructured-ingest
  • Pass ocr_mode in partition_pdf and set the default back to individual pages for now

Features

Fixes

0.10.2

17 Aug 06:27
dd0f582
Compare
Choose a tag to compare

0.10.2

Enhancements

  • Bump unstructured-inference==0.5.13:
    • Fix extracted image elements being included in layout merge, addresses the issue
      where an entire-page image in a PDF was not passed to the layout model when using hi_res.

Features

Fixes

0.10.1

17 Aug 04:33
9f7bd61
Compare
Choose a tag to compare

0.10.1

Enhancements

  • Bump unstructured-inference==0.5.12:
    • fix to avoid trace for certain PDF's (0.5.12)
    • better defaults for DPI for hi_res and Chipper (0.5.11)
    • implement full-page OCR (0.5.10)

Features

Fixes

  • Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)
  • Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)

0.10.0

16 Aug 04:36
0e887cc
Compare
Choose a tag to compare

0.10.0

Enhancements

  • Update the links and emphasized_texts metadata fields

Features

Fixes

0.9.3

15 Aug 05:17
cb923b9
Compare
Choose a tag to compare

0.9.3

Enhancements

  • Pinned dependency cleanup.
  • Update partition_csv to always use soupparser_fromstring to parse html text
  • Update partition_tsv to always use soupparser_fromstring to parse html text
  • Add metadata.section to capture epub table of contents data
  • Add unique_element_ids kwarg to partition functions. If True, will use a UUID
    for element IDs instead of a SHA-256 hash.
  • Update partition_xlsx to always use soupparser_fromstring to parse html text
  • Add functionality to switch html text parser based on whether the html text contains emoji
  • Add functionality to check if a string contains any emoji characters

Features

  • Add Airtable Connector to be able to pull views/tables/bases from an Airtable organization

Fixes

  • make notion module discoverable
  • fix emails with Content-Distribution: inline and Content-Distribution: attachment with no filename
  • Fix email attachment filenames which had = in the filename itself

0.9.2

11 Aug 02:30
6779918
Compare
Choose a tag to compare

0.9.2

Enhancements

  • Update table extraction section in API documentation to sync with change in Prod API
  • Update Notion connector to extract to html
  • Bump unstructured-inference==0.5.9:
    • better caching of models
    • another version of detectron2 available, though the default layout model is unchanged
  • Added UUID option for element_id

Features

  • Adds Sharepoint connector.

Fixes

  • Bump unstructured-inference==0.5.9:
    • ignores Tesseract errors where no text is extracted for tiles that indeed, have no text

0.9.1

09 Aug 05:56
2a9fb05
Compare
Choose a tag to compare

0.9.1

Enhancements

  • Adds --partition-pdf-infer-table-structure to unstructured-ingest.
  • Enable partition_html to skip headers and footers with the skip_headers_and_footers flag.
  • Update partition_doc and partition_docx to track emphasized texts in the output
  • Adds post processing function filter_element_types
  • Set the default strategy for partitioning images to hi_res
  • Add page break parameter section in API documentation to sync with change in Prod API
  • Update partition_html to track emphasized texts in the output
  • Update XMLDocument._read_xml to create <p> tag element for the text enclosed in the <pre> tag
  • Add parameter include_tail_text to _construct_text to enable (skip) tail text inclusion
  • Add Notion connector

Features

Fixes

  • Remove unused _partition_via_api function
  • Fixed emoji bug in partition_xlsx.
  • Pass file_filename metadata when partitioning file object
  • Skip ingest test on missing Slack token
  • Add Dropbox variables to CI environments
  • Remove default encoding for ingest
  • Adds new element type EmailAddress for recognizing email address in the  text
  • Simplifies min_partition logic; makes partitions falling below the min_partition
    less likely.
  • Fix bug where ingest test check for number of files fails in smoke test
  • Fix unstructured-ingest entrypoint failure

0.9.0

01 Aug 15:32
331c7fa
Compare
Choose a tag to compare

0.9.0

Enhancements

  • Dependencies are now split by document type, creating a slimmer base installation.

0.8.8

01 Aug 06:11
13d3559
Compare
Choose a tag to compare

0.8.8

Enhancements

Features

Fixes

  • Rename "date" field to "last_modified"
  • Adds Box connector

0.8.7

28 Jul 16:40
8cff756
Compare
Choose a tag to compare

0.8.7

Enhancements

  • Put back useful function split_by_paragraph

Features

Fixes

  • Fix argument order in NLTK download step