Skip to content

Large HTML documents cannot be partitioned using the partition_html function. #4289

@ratzrattillo

Description

@ratzrattillo

partitioning large html documents leads to an empty result. This is due to the missing huge_tree option on HTML parser generation in:

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L929

Fix:
Include the huge_tree option to solve this: etree.HTMLParser(remove_comments=True, huge_tree=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions