Description
I’m encountering an issue when using UnstructuredFileLoader
to process a Markdown (.md) file.
The loader throws an AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' when calling partition_md internally.
Steps to Reproduce:
- Install unstructured and langchain_community:
pip install unstructured langchain-community
2. Run the following code:
attached the file used in this code
sparql-language-ref.md
from langchain_community.document_loaders import UnstructuredFileLoader
file_path = "sparql-language-ref.md"
loader = UnstructuredFileLoader(file_path, mode="elements", autodetect_encoding=True)
pages = loader.load()
3. Observed Error:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
Expected Behavior:
• The Markdown file should be successfully loaded and parsed into elements.
• If the file has processing instructions, they should be ignored or handled gracefully without causing a crash.
Actual Behavior:
• The process crashes with an AttributeError in partition_md, specifically at:
while q and q[0].is_phrasing:
Environment Details:
• unstructured Version: (0.16.23)
• langchain_community Version: (0.3.18)
• Python Version: 3.10
• OS: Ubuntu 22.04 (WSL/Cloud-based environment)
Would appreciate any insights or a workaround for this issue! Thanks! 🙌