Skip to content

List block in a partitioned Markdown doc identified as a Title element under special conditions  #3280

Open
@nickphilip

Description

@nickphilip

I noticed that when the number of characters per line is very short in a list block in a Markdown document, the list is identified as a Title instead of a NarrativeText.

It prevents the chunking by title to work properly afterwards as I expect all content under a Markdown header to become a CompositeText to be indexed with an embedding function.

It is reproducible with the following python code:

from unstructured.partition.md import partition_md

md_text = """
# header 1
## header 2
My list
- item 1
- item 2
- item 3
- item 4
- item 5
## header 3
"""

elements = partition_md(text=md_text)

for el in elements:
    if el.category == "Title": print("{}: {}".format(el.category,el.text))

The moment any line of the list block (including the intro line) has more characters, the list block is then properly recognized as a NarrativeText.

Package version:
unstructured 0.14.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmarkdownRelated to partitioning Markdown documents

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions