Skip to content

0.12.1

Compare
Choose a tag to compare
@ron-unstructured ron-unstructured released this 20 Jan 00:13
· 564 commits to main since this release
c81d4e3

0.12.1

Enhancements

  • Allow setting image block crop padding parameter In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped.
  • Add suport for bitmap images in partition_image Adds support for .bmp files in
    partition, partition_image, and detect_filetype.
  • Keep all image elements when using "hi_res" strategy Previously, Image elements with small chunks of text were ignored unless the image block extraction parameters (extract_images_in_pdf or extract_image_block_types) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified.
  • Add filetype detection for .wav files. Add filetpye detection for .wav files.
  • Add "basic" chunking strategy. Add baseline chunking strategy that includes all shared chunking behaviors without breaking chunks on section or page boundaries.
  • Add overlap option for chunking. Add option to overlap chunks. Intra-chunk and inter-chunk overlap are requested separately. Intra-chunk overlap is applied only to the second and later chunks formed by text-splitting an oversized chunk. Inter-chunk overlap may also be specified; this applies overlap between "normal" (not-oversized) chunks.
  • Salesforce connector accepts private key path or value. Salesforce parameter private-key-file has been renamed to private-key. Private key can be provided as path to file or file contents.
  • Update documentation: (i) added verbiage about the free API cap limit, (ii) added deprecation warning on Staging bricks in favor of Destination Connectors, (iii) added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs, (iv) fixed example pages formatting, (v) added deprecation on model_name in favor of hi_res_model_name, (vi) added extract_images_in_pdf usage in partition_pdf section, (vii) reorganize and improve the documentation introduction section, and (viii) added PDF table extraction best practices.
  • Add "basic" chunking to ingest CLI. Add options to ingest CLI allowing access to the new "basic" chunking strategy and overlap options.
  • Make Elasticsearch Destination connector arguments optional. Elasticsearch Destination connector write settings are made optional and will rely on default values when not specified.
  • Normalize Salesforce artifact names. Introduced file naming pattern present in other connectors to Salesforce connector.
  • Install Kapa AI chatbot. Added Kapa.ai website widget on the documentation.

Features

  • MongoDB Source Connector. New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB.
  • Add OpenSearch source and destination connectors. OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch.

Fixes

  • Fix GCS connector converting JSON to string with single quotes. FSSpec serialization caused conversion of JSON token to string with single quotes. GCS requires token in form of dict so this format is now assured.
  • Pin version of unstructured-client Set minimum version of unstructured-client to avoid raising a TypeError when passing api_key_auth to UnstructuredClient
  • Fix the serialization of the Pinecone destination connector. Presence of the PineconeIndex object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix the serialization of the Elasticsearch destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix the serialization of the Postgres destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix documentation and sample code for Chroma. Was pointing to wrong examples..
  • Fix flatten_dict to be able to flatten tuples inside dicts Update flatten_dict function to support flattening tuples inside dicts. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened (still being a tuple).
  • Fix the serialization of the Chroma destination connector. Presence of the ChromaCollection object breaks serialization due to TypeError: cannot pickle 'module' object. This removes that object before serialization.
  • Fix fsspec connectors returning version as integer. Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string.