Skip to content

0.10.22

Compare
Choose a tag to compare
@awalker4 awalker4 released this 13 Oct 18:02
· 864 commits to main since this release
ad1b93d

Enhancements

  • bump unstructured-inference to 0.7.3 The updated version of unstructured-inference supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering.
  • Expose skip_infer_table_types in ingest CLI. For each connector a new --skip-infer-table-types parameter was added to map to the skip_infer_table_types partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction.
  • Add flag to ingest CLI to raise error if any single doc fails in pipeline Currently if a single doc fails in the pipeline, the whole thing halts due to the error. This flag defaults to log an error but continue with the docs it can.
  • Emit hyperlink metadata for DOCX file-type. DOCX partitioner now adds metadata.links, metadata.link_texts and metadata.link_urls for elements that contain a hyperlink that points to an external resource. So-called "jump" links pointing to document internal locations (such as those found in a table-of-contents "jumping" to a chapter or section) are excluded.

Features

  • Add elements_to_text as a staging helper function In order to get a single clean text output from unstructured for metric calculations, automate the process of extracting text from elements using this function.

  • Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector. Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.

Fixes

  • Fixes PDF list parsing creating duplicate list items Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list item
  • Fixes duplicated elements Fixes issue where elements are duplicated when embeddings are generated. This will allow users to generate embeddings for their list of Elements without duplicating/breaking the orginal content.
  • Fixes failure when flagging for embeddings through unstructured-ingest Currently adding the embedding parameter to any connector results in a failure on the copy stage. This is resolves the issue by adding the IngestDoc to the context map in the embedding node's run method. This allows users to specify that connectors fetch embeddings without failure.
  • Fix ingest pipeline reformat nodes not discoverable Fixes issue where reformat nodes raise ModuleNotFoundError on import. This was due to the directory was missing __init__.py in order to make it discoverable.
  • Fix default language in ingest CLI Previously the default was being set to english which injected potentially incorrect information to downstream language detection libraries. By setting the default to None allows those libraries to better detect what language the text is in the doc being processed.