Skip to content

feat(html): list-item adopts single child block element as its text #3499

Open
@SeanIsYoung

Description

@SeanIsYoung

Describe the bug
I'm using loose lists in markdown (each item is separated by a blank line.) and the html parser fails to identify the list. Depending on the context it either categorises the elements as Title or Narrative Text.

Comparing the loose vs tight lists it seems it's got something to do with the paragraph tag but I'm not sure how exactly that's affecting the parser.

I can mostly work around this by parsring the markdown text first and remove any newlines that prepend a list item. Or by removing the paragraph tags from the html. But I'm not sure if either of those will run up against edge cases at some point so it would be nice if the html parser could handle this.

To Reproduce

import markdown
from unstructured.partition.md import partition_md
from unstructured.partition.html import partition_html

loose_list = """
1. list item one.

2. list item two.

3. list item three.
"""

tight_list = """
1. list item one.
2. list item two.
3. list item three.
"""

print("markdown_loose:")
elements = partition_md(text=loose_list)
for el in elements:
    print(f"{el.category}: {el}")

print("\nmarkdown_tight:")
elements = partition_md(text=tight_list)
for el in elements:
    print(f"{el.category}: {el}")

print("\nhtml_loose:")
html = markdown.markdown(loose_list, extensions=["tables"])
print(html)

print("\nhtml_tight:")
html = markdown.markdown(tight_list, extensions=["tables"])
print(html)

Expected behavior
Both the tight and the loose lists should result in the ListItem element.

Screenshots
image

Environment Info

unstructured/scripts/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version:  3.11.9
unstructured version:  0.15.1
unstructured-inference version:  0.7.36
pytesseract version:  0.3.10
Torch version:  2.3.1.post300
Detectron2 version:  0.6
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 24.2.5.2 420(Build:2)

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmarkdownRelated to partitioning Markdown documents

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions