Skip to content

bug/<UnstructuredFileLoader Fails on Markdown Files (.md) – AttributeError in partition_md> #3935

Open
@kaustubh-darekar

Description

@kaustubh-darekar

I’m encountering an issue when using UnstructuredFileLoader to process a Markdown (.md) file.

The loader throws an AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' when calling partition_md internally.

Steps to Reproduce:

  1. Install unstructured and langchain_community:
    pip install unstructured langchain-community

2. Run the following code:
attached the file used in this code
sparql-language-ref.md

from langchain_community.document_loaders import UnstructuredFileLoader
file_path = "sparql-language-ref.md"
loader = UnstructuredFileLoader(file_path, mode="elements", autodetect_encoding=True)
pages = loader.load()

3. Observed Error:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Expected Behavior:
• The Markdown file should be successfully loaded and parsed into elements.
• If the file has processing instructions, they should be ignored or handled gracefully without causing a crash.

Actual Behavior:
• The process crashes with an AttributeError in partition_md, specifically at:
while q and q[0].is_phrasing:

Environment Details:
• unstructured Version: (0.16.23)
• langchain_community Version: (0.3.18)
• Python Version: 3.10
• OS: Ubuntu 22.04 (WSL/Cloud-based environment)

Would appreciate any insights or a workaround for this issue! Thanks! 🙌

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions