feat: capture images/figures

**Describe the bug**
Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a `<figure>` is silently removed in `partition_html()`. Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in `<figure>`s.

I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.

**To Reproduce**
```
from unstructured.partition.html import partition_html

elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
    for elem in elems:
        if elem.text.find(text) >= 0:
            print("found it:\n", elem)
            return
    print("nope")

find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure
```

**Expected behavior**
That the `<figure>` contents would either be found by default, or with an option controlling which elements to skip.

**Environment Info**
I don't have a local build going yet, but I promise it's a trivial repro in any environment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: capture images/figures #3606

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: capture images/figures #3606

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions