Enhancements
-
Add image_url of images in html partitioner
<img>
tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxml
instead ofbs4
to parse hOCR data.lxml
is much faster thanbs4
given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpy
to>2
. And upgradepaddlepaddle
,unstructured-paddleocr
,onnx
so they are compatible withnumpy>2
.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2