Skip to content

Commit a920e55

Browse files
natygyooncragwolfe
andauthored
fix: remove comments when parsing XML or HTML (#210)
* Update xml.py remove comments while parsing * change logged in CHANGLOG and editted version * make tidy * editted version * new version 0.4.8-dev1 * editted version * Update CHANGELOG.md Co-authored-by: cragwolfe <[email protected]> --------- Co-authored-by: cragwolfe <[email protected]>
1 parent 962de78 commit a920e55

File tree

3 files changed

+10
-2
lines changed

3 files changed

+10
-2
lines changed

CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.8
2+
3+
* Modified XML and HTML parsers not to load comments.
4+
15
## 0.4.7
26

37
* Added the ability to pull an HTML document from a url in `partition_html`.

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.7" # pragma: no cover
1+
__version__ = "0.4.8" # pragma: no cover

unstructured/documents/xml.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,11 @@ def __init__(
3232
are using a stylesheet, you likely want the XMLParser.
3333
"""
3434
if not parser:
35-
parser = etree.XMLParser() if stylesheet else etree.HTMLParser()
35+
parser = (
36+
etree.XMLParser(remove_comments=True)
37+
if stylesheet
38+
else etree.HTMLParser(remove_comments=True)
39+
)
3640

3741
self.stylesheet = stylesheet
3842
self.parser = parser

0 commit comments

Comments
 (0)