Releases: Unstructured-IO/unstructured
0.17.2
Enhancements
-
Add image_url of images in html partitioner
<img>
tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxml
instead ofbs4
to parse hOCR data.lxml
is much faster thanbs4
given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpy
to>2
. And upgradepaddlepaddle
,unstructured-paddleocr
,onnx
so they are compatible withnumpy>2
.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2
0.17.0
What's Changed
- feat: include images when partitioning html by @ryannikolaidis in #3945
- fix: pass extract image args to all partitioners by @ryannikolaidis in #3950
- feat: allow passing down of ocr agent and table agent by @badGarnet in #3954
- Feat/remove reference of PageLayout.elements by @badGarnet in #3943
Full Changelog: 0.16.25...0.17.0
0.16.25
0.16.24
0.16.24
Enhancements
-
Support dynamic partitioner file type registration. Use
create_file_type
to create new file type that can be handled
in unstructured andregister_partitioner
to enable registering your own partitioner for any file type. -
extract_image_block_types
now also works for CamelCase elemenet type names. PreviouslyNarrativeText
and similar CamelCase element types can't be extracted using the mentioned parameter inpartition
. Now figures for those elements can be extracted likeImage
andTable
elements -
use block matrix to reduce peak memory usage for pdf/image partition.
Features
- Add JSON elements to HTML converter - Converts JSON elements file into an HTML file.
Fixes
0.16.23
0.16.22
0.16.21
Enhancements
-
Use password to load PDF with all modes
-
use vectorized logic to merge inferred and extracted layouts. Using the new
LayoutElements
data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear -
Add PDF Miner configuration Now PDF Miner can be configured via
pdfminer_line_overlap
,pdfminer_word_margin
,pdfminer_line_margin
andpdfminer_char_margin
parameters added to partition method.
Features
Fixes
- Fix file type detection for NDJSON files NDJSON files were being detected as JSON due to having the same mime-type.
0.16.20
0.16.20
Enhancements
Features
Fixes
- Fix a security issue where rst and org files could read files in the local filesystem. Certain filetypes could 'include' or 'import' local files into their content, allowing partitioning of arbitrary files from the local filesystem. Partitioning of these files is now sandboxed.
0.16.19
Enhancements
Features
Fixes
- Fix a bug where table extraction is skipped when it shouldn't. Pages with just one table as its content or starts with a table misses table extraction. The routing logic is now fixed.
- Correct deprecated
ruff
invocation inmake tidy
. This will future-proof it or avoid surprises if someone happens to upgrade Ruff. - Remove upper bound constraint on python version in setup.py. Python3.13 is not yet officially supported, but allow users to try.
- Fixes removing HTML elements from the inside of table cells in html partition v=2.0. The HTML partitioner now correctly preserves HTML elements from the inside of table cells.
0.16.17
0.16.17
Enhancements
- Refactoring the VoyageAI integration to use voyageai package directly, allowing extra features.
Features
Fixes
- Fix a bug where
build_layout_elements_from_cor_regions
incorrectly joins texts in wrong order.
Full Changelog: 0.16.16...0.16.17