Skip to content

Commit f816b56

Browse files
change version
1 parent 042f9e0 commit f816b56

File tree

2 files changed

+30
-29
lines changed

2 files changed

+30
-29
lines changed

CHANGELOG.md

Lines changed: 29 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.16.2-dev
2+
3+
### Features
4+
5+
* **Whitespace-invariant CCT distance metric.** CCT Levenshtein distance for strings is by default computed with standardized whitespaces.
6+
17
## 0.16.1
28

39
### Enhancements
@@ -308,7 +314,6 @@
308314
### Features
309315

310316
* **Expose conversion functions for tables** Adds public functions to convert tables from HTML to the Deckerd format and back
311-
312317
* **Adds Kafka Source and Destination** New source and destination connector added to all CLI ingest commands to support reading from and writing to Kafka streams. Also supports Confluent Kafka.
313318

314319
### Fixes
@@ -355,7 +360,7 @@
355360

356361
* **Move logger error to debug level when PDFminer fails to extract text** which includes error message for Invalid dictionary construct.
357362
* **Add support for Pinecone serverless** Adds Pinecone serverless to the connector tests. Pinecone
358-
serverless will work version versions >=0.14.2, but hadn't been tested until now.
363+
serverless will work version versions >=0.14.2, but hadn't been tested until now.
359364

360365
### Features
361366

@@ -438,6 +443,7 @@
438443
* **Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR** configuration parameteres to control temporary storage.
439444

440445
### Features
446+
441447
* **Add form extraction basics (document elements and placeholder code in partition)**. This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a `NotImplementedError`.
442448

443449
### Fixes
@@ -615,8 +621,8 @@
615621
### Enhancements
616622

617623
### Features
618-
* Add `date_from_file_object` parameter to partition. If True and if file is provided via `file` parameter it will cause partition to infer last modified date from `file`'s content. If False, last modified metadata will be `None`.
619624

625+
* Add `date_from_file_object` parameter to partition. If True and if file is provided via `file` parameter it will cause partition to infer last modified date from `file`'s content. If False, last modified metadata will be `None`.
620626
* **Header and footer detection for fast strategy** `partition_pdf` with `fast` strategy now
621627
detects elements that are in the top or bottom 5 percent of the page as headers and footers.
622628
* **Add parent_element to overlapping case output** Adds parent_element to the output for `identify_overlapping_or_nesting_case` and `catch_overlapping_and_nested_bboxes` functions.
@@ -635,7 +641,6 @@
635641
* **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.**
636642
* **Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.
637643

638-
639644
## 0.12.4
640645

641646
### Enhancements
@@ -664,7 +669,6 @@
664669
* **Add title to Vectara upload - was not separated out from initial connector **
665670
* **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test **
666671

667-
668672
## 0.12.3
669673

670674
### Enhancements
@@ -717,6 +721,7 @@
717721
* **Install Kapa AI chatbot.** Added Kapa.ai website widget on the documentation.
718722

719723
### Features
724+
720725
* **MongoDB Source Connector.** New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB.
721726
* **Add OpenSearch source and destination connectors.** OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch.
722727

@@ -964,8 +969,8 @@
964969
* **Update `ocr_only` strategy in `partition_pdf()`** Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy.
965970

966971
### Fixes
967-
* **Fixed SharePoint permissions for the fetching to be opt-in** Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object.
968972

973+
* **Fixed SharePoint permissions for the fetching to be opt-in** Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object.
969974
* **Fixes issue where tables from markdown documents were being treated as text** Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents.
970975
* **Fix wrong logger for paddle info** Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
971976
* **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
@@ -1018,7 +1023,7 @@
10181023
### Features
10191024

10201025
* **Table OCR refactor** support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify
1021-
ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
1026+
ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
10221027
* **Adds accuracy function** The accuracy scoring was originally an option under `calculate_edit_distance`. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".
10231028
* **Adds HuggingFaceEmbeddingEncoder** The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
10241029
* **Add AWS bedrock embedding connector** `unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.
@@ -1049,7 +1054,7 @@ ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the en
10491054
### Fixes
10501055

10511056
* **Fix paddle model file not discoverable** Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding
1052-
an `__init__.py` file under the folder.
1057+
an `__init__.py` file under the folder.
10531058
* **Chipper v2 Fixes** Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7)
10541059
* **Fix image resizing issue** Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6)
10551060

@@ -1111,12 +1116,13 @@ an `__init__.py` file under the folder.
11111116
* **Applies `max_characters=<n>` argument to all element types in `add_chunking_strategy` decorator** Previously this argument was only utilized in chunking Table elements and now applies to all partitioned elements if `add_chunking_strategy` decorator is utilized, further preparing the elements for downstream processing.
11121117
* **Add common retry strategy utilities for unstructured-ingest** Dynamic retry strategy with exponential backoff added to Notion source connector.
11131118
*
1119+
11141120
### Features
11151121

11161122
* **Adds `bag_of_words` and `percent_missing_text` functions** In order to count the word frequencies in two input texts and calculate the percentage of text missing relative to the source document.
11171123
* **Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included.
11181124
* **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
1119-
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
1125+
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
11201126
* **Adds a function that calculates frequency of the element type and its depth** To capture the accuracy of element type extraction, this function counts the occurrences of each unique element type with its depth for use in element metrics.
11211127

11221128
### Fixes
@@ -1126,11 +1132,10 @@ setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
11261132
* **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
11271133
* **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
11281134
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
1129-
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
1130-
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
1135+
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
1136+
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
11311137
* **Fixes pdf uri error** An error was encountered when URI type of `GoToR` which refers to pdf resources outside of its own was detected since no condition catches such case. The code is fixing the issue by initialize URI before any condition check.
11321138

1133-
11341139
## 0.10.19
11351140

11361141
### Enhancements
@@ -1207,7 +1212,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
12071212

12081213
## 0.10.15
12091214

1210-
12111215
### Enhancements
12121216

12131217
* **Support for better element categories from the next-generation image-to-text model ("chipper").** Previously, not all of the classifications from Chipper were being mapped to proper `unstructured` element categories so the consumer of the library would see many `UncategorizedText` elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is:
@@ -1281,7 +1285,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
12811285
* Add Jira Connector to be able to pull issues from a Jira organization
12821286
* Add `clean_ligatures` function to expand ligatures in text
12831287

1284-
12851288
### Fixes
12861289

12871290
* `partition_html` breaks on `<br>` elements.
@@ -1299,14 +1302,12 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
12991302
* Support for yolox_quantized layout detection model (0.5.20)
13001303
* YoloX element types added
13011304

1302-
13031305
### Features
13041306

13051307
* Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead
13061308

13071309
### Fixes
13081310

1309-
13101311
* Bump unstructured-inference
13111312
* Avoid divide-by-zero errors swith `safe_division` (0.5.21)
13121313

@@ -1427,22 +1428,26 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
14271428
* Adds ability to reuse connections per process in unstructured-ingest
14281429

14291430
### Features
1431+
14301432
* Add delta table connector
14311433

14321434
### Fixes
14331435

14341436
## 0.10.4
1437+
14351438
* Pass ocr_mode in partition_pdf and set the default back to individual pages for now
14361439
* Add diagrams and descriptions for ingest design in the ingest README
14371440

14381441
### Features
1442+
14391443
* Supports multipage TIFF image partitioning
14401444

14411445
### Fixes
14421446

14431447
## 0.10.2
14441448

14451449
### Enhancements
1450+
14461451
* Bump unstructured-inference==0.5.13:
14471452
- Fix extracted image elements being included in layout merge, addresses the issue
14481453
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
@@ -1454,6 +1459,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
14541459
## 0.10.1
14551460

14561461
### Enhancements
1462+
14571463
* Bump unstructured-inference==0.5.12:
14581464
- fix to avoid trace for certain PDF's (0.5.12)
14591465
- better defaults for DPI for hi_res and Chipper (0.5.11)
@@ -1505,7 +1511,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
15051511

15061512
## 0.9.2
15071513

1508-
15091514
### Enhancements
15101515

15111516
* Update table extraction section in API documentation to sync with change in Prod API
@@ -1684,7 +1689,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
16841689
* Adjust encoding recognition threshold value in `detect_file_encoding`
16851690
* Fix KeyError when `isd_to_elements` doesn't find a type
16861691
* Fix `_output_filename` for local connector, allowing single files to be written correctly to the disk
1687-
16881692
* Fix for cases where an invalid encoding is extracted from an email header.
16891693

16901694
### BREAKING CHANGES
@@ -1696,6 +1700,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
16961700
### Enhancements
16971701

16981702
* Adds `include_metadata` kwarg to `partition_doc`, `partition_docx`, `partition_email`, `partition_epub`, `partition_json`, `partition_msg`, `partition_odt`, `partition_org`, `partition_pdf`, `partition_ppt`, `partition_pptx`, `partition_rst`, and `partition_rtf`
1703+
16991704
### Features
17001705

17011706
* Add Elasticsearch connector for ingest cli to pull specific fields from all documents in an index.
@@ -1930,10 +1935,8 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
19301935

19311936
### Features
19321937

1933-
19341938
### Fixes
19351939

1936-
19371940
## 0.6.10
19381941

19391942
### Enhancements
@@ -2030,7 +2033,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
20302033

20312034
### Fixes
20322035

2033-
20342036
## 0.6.4
20352037

20362038
### Enhancements
@@ -2067,7 +2069,6 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
20672069
* Added logic to `partition_pdf` for detecting copy protected PDFs and falling back
20682070
to the hi res strategy when necessary.
20692071

2070-
20712072
### Features
20722073

20732074
* Add `partition_via_api` for partitioning documents through the hosted API.
@@ -2138,8 +2139,8 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
21382139
* Added method to utils to allow date time format validation
21392140

21402141
### Features
2141-
* Add Slack connector to pull messages for a specific channel
21422142

2143+
* Add Slack connector to pull messages for a specific channel
21432144
* Add --partition-by-api parameter to unstructured-ingest
21442145
* Added `partition_rtf` for processing rich text files.
21452146
* `partition` now accepts a `url` kwarg in addition to `file` and `filename`.
@@ -2269,7 +2270,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
22692270
### Features
22702271

22712272
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
2272-
from `FsspecConnector`
2273+
from `FsspecConnector`
22732274
* Add `partition_epub` for partitioning e-books in EPUB3 format.
22742275

22752276
### Fixes
@@ -2302,16 +2303,16 @@ from `FsspecConnector`
23022303

23032304
* Fully move from printing to logging.
23042305
* `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest`
2305-
rather than a "tmp-ingest-" dir in the working directory.
2306+
rather than a "tmp-ingest-" dir in the working directory.
23062307

23072308
### Features
23082309

23092310
### Fixes
23102311

23112312
* `setup_ubuntu.sh` no longer fails in some contexts by interpreting
2312-
`DEBIAN_FRONTEND=noninteractive` as a command
2313+
`DEBIAN_FRONTEND=noninteractive` as a command
23132314
* `unstructured-ingest` no longer re-downloads files when --preserve-downloads
2314-
is used without --download-dir.
2315+
is used without --download-dir.
23152316
* Fixed an issue that was causing text to be skipped in some HTML documents.
23162317

23172318
## 0.5.1
@@ -2488,7 +2489,7 @@ is used without --download-dir.
24882489
* Add ability to extract document metadata from `.docx`, `.xlsx`, and `.jpg` files.
24892490
* Helper functions for identifying and extracting phone numbers
24902491
* Add new function `extract_attachment_info` that extracts and decodes the attachment
2491-
of an email.
2492+
of an email.
24922493
* Staging brick to convert a list of `Element`s to a `pandas` dataframe.
24932494
* Add plain text functionality to `partition_email`
24942495

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.16.1" # pragma: no cover
1+
__version__ = "0.16.2-dev" # pragma: no cover

0 commit comments

Comments
 (0)