Releases: Unstructured-IO/unstructured
0.18.4
What's Changed
- fix(partition, csv): increase csv field limit by @ds-filipknefel in #4046
Full Changelog: 0.18.3...0.18.4
0.18.3
0.18.2
What's Changed
- fix [NEX-49] : Fix TypeError for empty HTML content by @yuming-long in #4032
- fix: add try/except wrap over row.cells to failproof tc grid_offset by @Klaijan in #4033
- fix: xml processing not escaped by @jiajun-unstructured in #4034
- fix: update md to reads umlauts on non-utf-8 files by @Klaijan in #4037
- bump version by @Klaijan in #4038
- fix: fix header and footer not parsed as Header/Footer types by @badGarnet in #4041
- bump version to make a release by @badGarnet in #4042
Full Changelog: 0.18.1...0.18.2
0.18.1
Enhancements
Features
- Add DocumentData element type This is helpful in scenarios where there is large data that does not make sense to represent across each element in the document.
Fixes
- The
encoding
property of the_CsvPartitioningContext
is now properly used.
0.17.11-dev1
What's Changed
- Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names by @srisudarsan in #3959
- manual trigger of workflows to publish new image and new vers tag in … by @luke-kucing in #3965
- chore: deprecate stage_for_label_studio by @qued in #3968
- build: remove test and dev deps from docker image by @qued in #3969
- feat: convenience unstructured-get-json.sh update by @cragwolfe in #3971
- chore: allow changing default output dir for unstructured-get-json.sh by @cragwolfe in #3973
- chore: add html path to ingest-test-fixtures-update-pr by @cragwolfe in #3977
- fix: hi_res PDF parsing: only uncategorized text for extracted elements by @cragwolfe in #3975
- Fix sort_page_element. ensures that sorting is stable and not random. by @pprados in #3978
- Update pdfminer_utils.py by @Nathan-GoSupply in #3974
- fix cve by @potter-potter in #3989
- fix: Add missing diffstat command to test_json_to_html CI job by @mpolomdeepsense in #3992
- fix: failing build by @mpolomdeepsense in #3993
- fix: properly handle the case when an element's text is None by @badGarnet in #3995
- fix: Fix for Pillow error when extracting PNG images by @awalker4 in #3998
- fix: throw validation error when json is passed with invalid unstructured json by @jordan-homan in #4002
- Replace Serverless API to Platform announcement on README page by @ron-unstructured in #4003
- fix: resolve warnings of logger library by @emmanuel-ferdman in #3999
- chore: script to verify unstructured image outbound connectivity by @cragwolfe in #4008
- resolve CVEs and HF issue by @luke-kucing in #4009
- Feat/bump inference by @badGarnet in #4013
- Bump requests to address CVEs by @PastelStorm in #4015
- Drop Python 3.9 support due to dependency conflicts by @PastelStorm in #4017
- Remove IDs from HTML code by @plutasnyy in #4012
- fix chucking text None type has no attribute stripe by @yuming-long in #4018
- recompile on arm64 to get minimum reqs by @badGarnet in #4020
New Contributors
- @srisudarsan made their first contribution in #3959
- @Nathan-GoSupply made their first contribution in #3974
- @jordan-homan made their first contribution in #4002
- @emmanuel-ferdman made their first contribution in #3999
- @PastelStorm made their first contribution in #4015
Full Changelog: 0.17.2...0.17.11-dev1
0.17.2
Enhancements
-
Add image_url of images in html partitioner
<img>
tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxml
instead ofbs4
to parse hOCR data.lxml
is much faster thanbs4
given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpy
to>2
. And upgradepaddlepaddle
,unstructured-paddleocr
,onnx
so they are compatible withnumpy>2
.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2
0.17.0
What's Changed
- feat: include images when partitioning html by @ryannikolaidis in #3945
- fix: pass extract image args to all partitioners by @ryannikolaidis in #3950
- feat: allow passing down of ocr agent and table agent by @badGarnet in #3954
- Feat/remove reference of PageLayout.elements by @badGarnet in #3943
Full Changelog: 0.16.25...0.17.0
0.16.25
0.16.24
0.16.24
Enhancements
-
Support dynamic partitioner file type registration. Use
create_file_type
to create new file type that can be handled
in unstructured andregister_partitioner
to enable registering your own partitioner for any file type. -
extract_image_block_types
now also works for CamelCase elemenet type names. PreviouslyNarrativeText
and similar CamelCase element types can't be extracted using the mentioned parameter inpartition
. Now figures for those elements can be extracted likeImage
andTable
elements -
use block matrix to reduce peak memory usage for pdf/image partition.
Features
- Add JSON elements to HTML converter - Converts JSON elements file into an HTML file.