0.10.25

shreyanid released this 21 Oct 02:45

· 857 commits to main since this release

82c8adb

Enhancements

Duplicate CLI param check Given that many of the options associated with the Click based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options.

Features

Table OCR refactor support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify ocr agent tesseract/paddle in environment variable OCR_AGENT for OCRing the entire document.
Adds accuracy function The accuracy scoring was originally an option under calculate_edit_distance. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".
Adds HuggingFaceEmbeddingEncoder The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
Add AWS bedrock embedding connector unstructured.embed.bedrock now provides a connector to use AWS bedrock's titan-embed-text model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.

Fixes

Import PDFResourceManager more directly We were importing PDFResourceManager from pdfminer.converter which was causing an error for some users. We changed to import from the actual location of PDFResourceManager, which is pdfminer.pdfinterp.
Fix language detection of elements with empty strings This resolves a warning message that was raised by langdetect if the language was attempted to be detected on an empty string. Language detection is now skipped for empty strings.
Fix chunks breaking on regex-metadata matches. Fixes "over-chunking" when regex_metadata was used, where every element that contained a regex-match would start a new chunk.
Fix regex-metadata match offsets not adjusted within chunk. Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
Fix metrics folder not discoverable Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an __init__.py file under the folder.
Fix a bug when partition_pdf get model_name=None In API usage the model_name value is None and the cast function in partition_pdf would return None and lead to attribution error. Now we use str function to explicit convert the content to string so it is guaranteed to have starts_with and other string functions as attributes
Fix html partition fail on tables without tbody tag HTML tables may sometimes just contain headers without body (tbody tag)
Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded max_characters.

Assets 2