Skip to content

feat: capture images/figures #3606

Open
@joelgwebber

Description

@joelgwebber

Describe the bug
Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a <figure> is silently removed in partition_html(). Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in <figure>s.

I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.

To Reproduce

from unstructured.partition.html import partition_html

elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
    for elem in elems:
        if elem.text.find(text) >= 0:
            print("found it:\n", elem)
            return
    print("nope")

find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure

Expected behavior
That the <figure> contents would either be found by default, or with an option controlling which elements to skip.

Environment Info
I don't have a local build going yet, but I promise it's a trivial repro in any environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions