Describe the bug
The HtmlTable.from_html_text() method drops text content that follows <br/> tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.
To Reproduce
from unstructured.common.html_table import HtmlTable
html_text = """
<table>
<tr>
<td>This is 1st line.<br/>2nd line.<br/>3rd line.</td>
</tr>
</table>
"""
table = HtmlTable.from_html_text(html_text)
print(table.html)
Output:
<table><tr><td>This is 1st line.<br/><br/></td></tr></table>
Expected Output:
<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>
Expected behavior
The text content following <br/> tags should be preserved during HTML normalization. Currently, the tail text of <br/> elements is being removed, which results in loss of content.
Screenshots
No screenshots.
Environment Info
- unstructured version: 0.16.17
- Python version: 3.11
- OS: MacOS
Additional context
It is possible that the issue could be resolved by modifying the from_html_text() method to preserve the tail text of <br/> tags while normalising whitespace.
class HtmlTable:
...
@classmethod
def from_html_text(cls, html_text: str) -> 'CustomHtmlTable':
...
# -- normalize br tag tail text
if e.tag == "br":
if e.tail:
e.tail = " ".join(e.tail.split())
else:
# -- remove tails for non-br elements
if e.tail:
e.tail = None
...
Describe the bug
The
HtmlTable.from_html_text()method drops text content that follows<br/>tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.To Reproduce
Output:
Expected Output:
Expected behavior
The text content following
<br/>tags should be preserved during HTML normalization. Currently, thetailtext of<br/>elements is being removed, which results in loss of content.Screenshots
No screenshots.
Environment Info
Additional context
It is possible that the issue could be resolved by modifying the
from_html_text()method to preserve the tail text of<br/>tags while normalising whitespace.