Releases · Unstructured-IO/unstructured

08 Jul 08:13

ds-filipknefel

0.18.4

f078cd9

0.18.4 Latest

Latest

What's Changed

fix(partition, csv): increase csv field limit by @ds-filipknefel in #4046

Full Changelog: 0.18.3...0.18.4

Contributors

ds-filipknefel

Assets 2

05 Jul 19:34

awalker4

0.18.3

8a9abdd

0.18.3

What's Changed

chore: bump pillow to address a CVE by @awalker4 in #4045

Full Changelog: 0.18.2...0.18.3

Contributors

awalker4

Assets 2

01 Jul 23:42

badGarnet

0.18.2

d7dfda9

0.18.2

What's Changed

fix [NEX-49] : Fix TypeError for empty HTML content by @yuming-long in #4032
fix: add try/except wrap over row.cells to failproof tc grid_offset by @Klaijan in #4033
fix: xml processing not escaped by @jiajun-unstructured in #4034
fix: update md to reads umlauts on non-utf-8 files by @Klaijan in #4037
bump version by @Klaijan in #4038
fix: fix header and footer not parsed as Header/Footer types by @badGarnet in #4041
bump version to make a release by @badGarnet in #4042

Full Changelog: 0.18.1...0.18.2

Contributors

badGarnet, Klaijan, and 2 other contributors

Assets 2

24 Jun 23:52

ryannikolaidis

0.18.1

3f87946

0.18.1

Enhancements

Features

Add DocumentData element type This is helpful in scenarios where there is large data that does not make sense to represent across each element in the document.

Fixes

The encoding property of the _CsvPartitioningContext is now properly used.

Assets 2

13 Jun 02:43

PastelStorm

0.17.11-dev1

5e43e36

0.17.11-dev1 Pre-release

Pre-release

What's Changed

Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names by @srisudarsan in #3959
manual trigger of workflows to publish new image and new vers tag in … by @luke-kucing in #3965
chore: deprecate stage_for_label_studio by @qued in #3968
build: remove test and dev deps from docker image by @qued in #3969
feat: convenience unstructured-get-json.sh update by @cragwolfe in #3971
chore: allow changing default output dir for unstructured-get-json.sh by @cragwolfe in #3973
chore: add html path to ingest-test-fixtures-update-pr by @cragwolfe in #3977
fix: hi_res PDF parsing: only uncategorized text for extracted elements by @cragwolfe in #3975
Fix sort_page_element. ensures that sorting is stable and not random. by @pprados in #3978
Update pdfminer_utils.py by @Nathan-GoSupply in #3974
fix cve by @potter-potter in #3989
fix: Add missing diffstat command to test_json_to_html CI job by @mpolomdeepsense in #3992
fix: failing build by @mpolomdeepsense in #3993
fix: properly handle the case when an element's text is None by @badGarnet in #3995
fix: Fix for Pillow error when extracting PNG images by @awalker4 in #3998
fix: throw validation error when json is passed with invalid unstructured json by @jordan-homan in #4002
Replace Serverless API to Platform announcement on README page by @ron-unstructured in #4003
fix: resolve warnings of logger library by @emmanuel-ferdman in #3999
chore: script to verify unstructured image outbound connectivity by @cragwolfe in #4008
resolve CVEs and HF issue by @luke-kucing in #4009
Feat/bump inference by @badGarnet in #4013
Bump requests to address CVEs by @PastelStorm in #4015
Drop Python 3.9 support due to dependency conflicts by @PastelStorm in #4017
Remove IDs from HTML code by @plutasnyy in #4012
fix chucking text None type has no attribute stripe by @yuming-long in #4018
recompile on arm64 to get minimum reqs by @badGarnet in #4020

New Contributors

@srisudarsan made their first contribution in #3959
@Nathan-GoSupply made their first contribution in #3974
@jordan-homan made their first contribution in #4002
@emmanuel-ferdman made their first contribution in #3999
@PastelStorm made their first contribution in #4015

Full Changelog: 0.17.2...0.17.11-dev1

Contributors

pprados, badGarnet, and 14 other contributors

Assets 2

20 Mar 16:52

ajjimeno

0.17.2

0fa5174

0.17.2

Enhancements

Add image_url of images in html partitioner <img> tags with non-data content include a new image_url metadata field with the content of the src attribute.
Use lxml instead of bs4 to parse hOCR data. lxml is much faster than bs4 given the hOCR data format is regular (garanteed because it is programatically generated)
bump numpy to >2. And upgrade paddlepaddle, unstructured-paddleocr, onnx so they are compatible with numpy>2.

Fixes

Fix Image in a
tag is "UncategorizedText" with no .text

What's Changed

feat: support extracting image url in html by @ryannikolaidis in #3955
feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
Feat/bump numpy to 2 by @badGarnet in #3961
Image within div or span with no text is annotated as Image by @ajjimeno in #3962

Full Changelog: 0.17.0...0.17.2

Contributors

badGarnet, ryannikolaidis, and ajjimeno

Assets 2

12 Mar 15:57

badGarnet

0.17.0

2dceac3

0.17.0

What's Changed

feat: include images when partitioning html by @ryannikolaidis in #3945
fix: pass extract image args to all partitioners by @ryannikolaidis in #3950
feat: allow passing down of ocr agent and table agent by @badGarnet in #3954
Feat/remove reference of PageLayout.elements by @badGarnet in #3943

Full Changelog: 0.16.25...0.17.0

Contributors

badGarnet and ryannikolaidis

Assets 2

07 Mar 11:17

plutasnyy

0.16.25

74b0647

0.16.25

Enhancements

Features

Fixes

Fixes filetype detection for jsons passed as byte streams - Now it prioritizes magic mimetype prediction over file extension when detecting filetypes

Assets 2

07 Mar 11:17

plutasnyy

0.16.24

961c8d5

0.16.24

Enhancements

Support dynamic partitioner file type registration. Use create_file_type to create new file type that can be handled
in unstructured and register_partitioner to enable registering your own partitioner for any file type.
extract_image_block_types now also works for CamelCase elemenet type names. Previously NarrativeText and similar CamelCase element types can't be extracted using the mentioned parameter in partition. Now figures for those elements can be extracted like Image and Table elements
use block matrix to reduce peak memory usage for pdf/image partition.

Features

Add JSON elements to HTML converter - Converts JSON elements file into an HTML file.

Fixes

Assets 2

20 Feb 13:31

plutasnyy

0.16.23

0df50fe

0.16.23

Enhancements

Features

Fixes

Fixes detect_filetype when SpooledTemporaryFile is passed. Previously some random name would get assigned to the file and the function raised error.

Assets 2

Releases: Unstructured-IO/unstructured

0.18.4

What's Changed

Contributors

Uh oh!

0.18.3

What's Changed

Contributors

Uh oh!

0.18.2

What's Changed

Contributors

Uh oh!

0.18.1

Enhancements

Features

Fixes

Uh oh!

0.17.11-dev1

What's Changed

New Contributors

Contributors

Uh oh!

0.17.2

Enhancements

Fixes

What's Changed

Contributors

Uh oh!

0.17.0

What's Changed

Contributors

Uh oh!

0.16.25

0.16.25

Enhancements

Features

Fixes

Uh oh!

0.16.24

0.16.24

Enhancements

Features

Fixes

Uh oh!

0.16.23

0.16.23

Enhancements

Features

Fixes

Uh oh!