Description
Describe the bug
I'm using loose lists in markdown (each item is separated by a blank line.) and the html parser fails to identify the list. Depending on the context it either categorises the elements as Title
or Narrative Text
.
Comparing the loose vs tight lists it seems it's got something to do with the paragraph tag but I'm not sure how exactly that's affecting the parser.
I can mostly work around this by parsring the markdown text first and remove any newlines that prepend a list item. Or by removing the paragraph tags from the html. But I'm not sure if either of those will run up against edge cases at some point so it would be nice if the html parser could handle this.
To Reproduce
import markdown
from unstructured.partition.md import partition_md
from unstructured.partition.html import partition_html
loose_list = """
1. list item one.
2. list item two.
3. list item three.
"""
tight_list = """
1. list item one.
2. list item two.
3. list item three.
"""
print("markdown_loose:")
elements = partition_md(text=loose_list)
for el in elements:
print(f"{el.category}: {el}")
print("\nmarkdown_tight:")
elements = partition_md(text=tight_list)
for el in elements:
print(f"{el.category}: {el}")
print("\nhtml_loose:")
html = markdown.markdown(loose_list, extensions=["tables"])
print(html)
print("\nhtml_tight:")
html = markdown.markdown(tight_list, extensions=["tables"])
print(html)
Expected behavior
Both the tight and the loose lists should result in the ListItem
element.
Environment Info
unstructured/scripts/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
OS version: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.11.9
unstructured version: 0.15.1
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.3.1.post300
Detectron2 version: 0.6
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version: LibreOffice 24.2.5.2 420(Build:2)
Additional context
Add any other context about the problem here.