Open
Description
I noticed that when the number of characters per line is very short in a list block in a Markdown document, the list is identified as a Title
instead of a NarrativeText
.
It prevents the chunking by title to work properly afterwards as I expect all content under a Markdown header to become a CompositeText
to be indexed with an embedding function.
It is reproducible with the following python code:
from unstructured.partition.md import partition_md
md_text = """
# header 1
## header 2
My list
- item 1
- item 2
- item 3
- item 4
- item 5
## header 3
"""
elements = partition_md(text=md_text)
for el in elements:
if el.category == "Title": print("{}: {}".format(el.category,el.text))
The moment any line of the list block (including the intro line) has more characters, the list block is then properly recognized as a NarrativeText
.
Package version:
unstructured 0.14.5