You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+29-28Lines changed: 29 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,9 @@
1
+
## 0.16.2-dev
2
+
3
+
### Features
4
+
5
+
***Whitespace-invariant CCT distance metric.** CCT Levenshtein distance for strings is by default computed with standardized whitespaces.
6
+
1
7
## 0.16.1
2
8
3
9
### Enhancements
@@ -308,7 +314,6 @@
308
314
### Features
309
315
310
316
***Expose conversion functions for tables** Adds public functions to convert tables from HTML to the Deckerd format and back
311
-
312
317
***Adds Kafka Source and Destination** New source and destination connector added to all CLI ingest commands to support reading from and writing to Kafka streams. Also supports Confluent Kafka.
313
318
314
319
### Fixes
@@ -355,7 +360,7 @@
355
360
356
361
***Move logger error to debug level when PDFminer fails to extract text** which includes error message for Invalid dictionary construct.
357
362
***Add support for Pinecone serverless** Adds Pinecone serverless to the connector tests. Pinecone
358
-
serverless will work version versions >=0.14.2, but hadn't been tested until now.
363
+
serverless will work version versions >=0.14.2, but hadn't been tested until now.
359
364
360
365
### Features
361
366
@@ -438,6 +443,7 @@
438
443
***Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR** configuration parameteres to control temporary storage.
439
444
440
445
### Features
446
+
441
447
***Add form extraction basics (document elements and placeholder code in partition)**. This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a `NotImplementedError`.
442
448
443
449
### Fixes
@@ -615,8 +621,8 @@
615
621
### Enhancements
616
622
617
623
### Features
618
-
* Add `date_from_file_object` parameter to partition. If True and if file is provided via `file` parameter it will cause partition to infer last modified date from `file`'s content. If False, last modified metadata will be `None`.
619
624
625
+
* Add `date_from_file_object` parameter to partition. If True and if file is provided via `file` parameter it will cause partition to infer last modified date from `file`'s content. If False, last modified metadata will be `None`.
620
626
***Header and footer detection for fast strategy**`partition_pdf` with `fast` strategy now
621
627
detects elements that are in the top or bottom 5 percent of the page as headers and footers.
622
628
***Add parent_element to overlapping case output** Adds parent_element to the output for `identify_overlapping_or_nesting_case` and `catch_overlapping_and_nested_bboxes` functions.
@@ -635,7 +641,6 @@
635
641
***Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.**
636
642
***Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.
637
643
638
-
639
644
## 0.12.4
640
645
641
646
### Enhancements
@@ -664,7 +669,6 @@
664
669
***Add title to Vectara upload - was not separated out from initial connector **
665
670
***Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test **
666
671
667
-
668
672
## 0.12.3
669
673
670
674
### Enhancements
@@ -717,6 +721,7 @@
717
721
***Install Kapa AI chatbot.** Added Kapa.ai website widget on the documentation.
718
722
719
723
### Features
724
+
720
725
***MongoDB Source Connector.** New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB.
721
726
***Add OpenSearch source and destination connectors.** OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch.
722
727
@@ -964,8 +969,8 @@
964
969
***Update `ocr_only` strategy in `partition_pdf()`** Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy.
965
970
966
971
### Fixes
967
-
***Fixed SharePoint permissions for the fetching to be opt-in** Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object.
968
972
973
+
***Fixed SharePoint permissions for the fetching to be opt-in** Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object.
969
974
***Fixes issue where tables from markdown documents were being treated as text** Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents.
970
975
***Fix wrong logger for paddle info** Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
971
976
***Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
@@ -1018,7 +1023,7 @@
1018
1023
### Features
1019
1024
1020
1025
***Table OCR refactor** support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify
1021
-
ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
1026
+
ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
1022
1027
***Adds accuracy function** The accuracy scoring was originally an option under `calculate_edit_distance`. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".
1023
1028
***Adds HuggingFaceEmbeddingEncoder** The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
1024
1029
***Add AWS bedrock embedding connector**`unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.
@@ -1049,7 +1054,7 @@ ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the en
1049
1054
### Fixes
1050
1055
1051
1056
***Fix paddle model file not discoverable** Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding
1052
-
an `__init__.py` file under the folder.
1057
+
an `__init__.py` file under the folder.
1053
1058
***Chipper v2 Fixes** Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7)
1054
1059
***Fix image resizing issue** Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6)
1055
1060
@@ -1111,12 +1116,13 @@ an `__init__.py` file under the folder.
1111
1116
***Applies `max_characters=<n>` argument to all element types in `add_chunking_strategy` decorator** Previously this argument was only utilized in chunking Table elements and now applies to all partitioned elements if `add_chunking_strategy` decorator is utilized, further preparing the elements for downstream processing.
1112
1117
***Add common retry strategy utilities for unstructured-ingest** Dynamic retry strategy with exponential backoff added to Notion source connector.
1113
1118
*
1119
+
1114
1120
### Features
1115
1121
1116
1122
***Adds `bag_of_words` and `percent_missing_text` functions** In order to count the word frequencies in two input texts and calculate the percentage of text missing relative to the source document.
1117
1123
***Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included.
1118
1124
***Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
1119
-
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
1125
+
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
1120
1126
***Adds a function that calculates frequency of the element type and its depth** To capture the accuracy of element type extraction, this function counts the occurrences of each unique element type with its depth for use in element metrics.
1121
1127
1122
1128
### Fixes
@@ -1126,11 +1132,10 @@ setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
1126
1132
***Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
1127
1133
***Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
1128
1134
***Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
1129
-
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
1130
-
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
1135
+
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
1136
+
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
1131
1137
***Fixes pdf uri error** An error was encountered when URI type of `GoToR` which refers to pdf resources outside of its own was detected since no condition catches such case. The code is fixing the issue by initialize URI before any condition check.
1132
1138
1133
-
1134
1139
## 0.10.19
1135
1140
1136
1141
### Enhancements
@@ -1207,7 +1212,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
1207
1212
1208
1213
## 0.10.15
1209
1214
1210
-
1211
1215
### Enhancements
1212
1216
1213
1217
***Support for better element categories from the next-generation image-to-text model ("chipper").** Previously, not all of the classifications from Chipper were being mapped to proper `unstructured` element categories so the consumer of the library would see many `UncategorizedText` elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is:
@@ -1281,7 +1285,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
1281
1285
* Add Jira Connector to be able to pull issues from a Jira organization
1282
1286
* Add `clean_ligatures` function to expand ligatures in text
1283
1287
1284
-
1285
1288
### Fixes
1286
1289
1287
1290
*`partition_html` breaks on `<br>` elements.
@@ -1299,14 +1302,12 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
1299
1302
* Support for yolox_quantized layout detection model (0.5.20)
1300
1303
* YoloX element types added
1301
1304
1302
-
1303
1305
### Features
1304
1306
1305
1307
* Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead
0 commit comments