Open
Description
Describe the bug
HTML and XML code blocks in markdown are not parsed properly.
Results:
HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```
- HTML tags are not preserved.
- XML code is malformed. The blank lines may erase the context.
<?xml version='1.0' encoding='UTF-8'?>
line breaks the parser.
Traceback (most recent call last):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/test.py", line 14, in <module>
elems = partition_html(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
elements = list(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
elements = list(elements)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
yield from cls(opts)._iter_elements()
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
for e in self._main.iter_elements():
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
yield from self._element_from_text_or_tail(block_item.tail or "", q)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
for node in self._iter_text_segments(text, q):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
To Reproduce
## HTML Example
```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample HTML</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a simple HTML example.</p>
</body>
</html>
```
## XML Example
```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
```xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
Expected behavior
The content in code blocks should be preserved as it is.
Screenshots
Environment Info
0.15.7
Additional context
Since markdown is first converted to html, adding extensions=['fenced_code']
to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter.
unstructured/unstructured/partition/md.py
Line 109 in f440eb4