Open
Description
Describe the bug
The HtmlTable.from_html_text()
method drops text content that follows <br/>
tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.
To Reproduce
from unstructured.common.html_table import HtmlTable
html_text = """
<table>
<tr>
<td>This is 1st line.<br/>2nd line.<br/>3rd line.</td>
</tr>
</table>
"""
table = HtmlTable.from_html_text(html_text)
print(table.html)
Output:
<table><tr><td>This is 1st line.<br/><br/></td></tr></table>
Expected Output:
<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>
Expected behavior
The text content following <br/>
tags should be preserved during HTML normalization. Currently, the tail
text of <br/>
elements is being removed, which results in loss of content.
Screenshots
No screenshots.
Environment Info
- unstructured version: 0.16.17
- Python version: 3.11
- OS: MacOS
Additional context
It is possible that the issue could be resolved by modifying the from_html_text()
method to preserve the tail text of <br/>
tags while normalising whitespace.
class HtmlTable:
...
@classmethod
def from_html_text(cls, html_text: str) -> 'CustomHtmlTable':
...
# -- normalize br tag tail text
if e.tag == "br":
if e.tail:
e.tail = " ".join(e.tail.split())
else:
# -- remove tails for non-br elements
if e.tail:
e.tail = None
...