Skip to content

0.17.2

Latest
Compare
Choose a tag to compare
@ajjimeno ajjimeno released this 20 Mar 16:52
· 9 commits to main since this release
0fa5174

Enhancements

  • Add image_url of images in html partitioner <img> tags with non-data content include a new image_url metadata field with the content of the src attribute.

  • Use lxml instead of bs4 to parse hOCR data. lxml is much faster than bs4 given the hOCR data format is regular (garanteed because it is programatically generated)

  • bump numpy to >2. And upgrade paddlepaddle, unstructured-paddleocr, onnx so they are compatible with numpy>2.

Fixes

  • Fix Image in a
    tag is "UncategorizedText" with no .text

What's Changed

Full Changelog: 0.17.0...0.17.2