0.10.22
Enhancements
- bump
unstructured-inference
to0.7.3
The updated version ofunstructured-inference
supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering. - Expose skip_infer_table_types in ingest CLI. For each connector a new
--skip-infer-table-types
parameter was added to map to theskip_infer_table_types
partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction. - Add flag to ingest CLI to raise error if any single doc fails in pipeline Currently if a single doc fails in the pipeline, the whole thing halts due to the error. This flag defaults to log an error but continue with the docs it can.
- Emit hyperlink metadata for DOCX file-type. DOCX partitioner now adds
metadata.links
,metadata.link_texts
andmetadata.link_urls
for elements that contain a hyperlink that points to an external resource. So-called "jump" links pointing to document internal locations (such as those found in a table-of-contents "jumping" to a chapter or section) are excluded.
Features
-
Add
elements_to_text
as a staging helper function In order to get a single clean text output from unstructured for metric calculations, automate the process of extracting text from elements using this function. -
Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector. Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.
Fixes
- Fixes PDF list parsing creating duplicate list items Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list item
- Fixes duplicated elements Fixes issue where elements are duplicated when embeddings are generated. This will allow users to generate embeddings for their list of Elements without duplicating/breaking the orginal content.
- Fixes failure when flagging for embeddings through unstructured-ingest Currently adding the embedding parameter to any connector results in a failure on the copy stage. This is resolves the issue by adding the IngestDoc to the context map in the embedding node's
run
method. This allows users to specify that connectors fetch embeddings without failure. - Fix ingest pipeline reformat nodes not discoverable Fixes issue where reformat nodes raise ModuleNotFoundError on import. This was due to the directory was missing
__init__.py
in order to make it discoverable. - Fix default language in ingest CLI Previously the default was being set to english which injected potentially incorrect information to downstream language detection libraries. By setting the default to None allows those libraries to better detect what language the text is in the doc being processed.