Skip to content

bug/br tag tail text loss #3899

Open
Open
@K-Oxon

Description

@K-Oxon

Describe the bug
The HtmlTable.from_html_text() method drops text content that follows <br/> tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.

To Reproduce

from unstructured.common.html_table import HtmlTable
html_text = """
<table>
<tr>
<td>This is 1st line.<br/>2nd line.<br/>3rd line.</td>
</tr>
</table>
"""
table = HtmlTable.from_html_text(html_text)
print(table.html)

Output:

<table><tr><td>This is 1st line.<br/><br/></td></tr></table>

Expected Output:

<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>

Expected behavior

The text content following <br/> tags should be preserved during HTML normalization. Currently, the tail text of <br/> elements is being removed, which results in loss of content.

Screenshots
No screenshots.

Environment Info

  • unstructured version: 0.16.17
  • Python version: 3.11
  • OS: MacOS

Additional context
It is possible that the issue could be resolved by modifying the from_html_text() method to preserve the tail text of <br/> tags while normalising whitespace.

class HtmlTable:
    ...
    @classmethod
    def from_html_text(cls, html_text: str) -> 'CustomHtmlTable':
            ...
            # -- normalize br tag tail text
            if e.tag == "br":
                if e.tail:
                    e.tail = " ".join(e.tail.split())
            else:
                # -- remove tails for non-br elements
                if e.tail:
                    e.tail = None
            ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions