Skip to content

feat/chunk_elements #3921

Open
Open
@Jimmy-web169

Description

@Jimmy-web169

When I use chunk_elements on my List[Elements], the Table element is always combined with other elements, resulting in a CompositeElement. In addition, whenever the text_as_html characters exceed the default maximum value, I encounter the same issue.(resulting in a CompositeElement)

According to the official documentation for unstructured regarding the chunk_element function:

A single element that exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text splitting.
A Table element is always isolated and never combined with another element. If a Table is oversized (exceeding the hard-max), it is divided into two or more TableChunk elements using text splitting.

I anticipated that a Table element would never be combined with other elements into a CompositeElement.

One approach is to refactor the will_fit() method as follows:

def will_fit(self, element: Element) -> bool:
        # -- if the new element is a Table, it can only fit in an empty pre-chunk --
        if isinstance(element, Table):
            return len(self._elements) == 0

        # -- if the pre-chunk already contains a Table, no additional element should fit --
        if any(isinstance(e, Table) for e in self._elements):
            return False

        # -- an empty pre-chunk will accept any element (including an oversized element) --
        if len(self._elements) == 0:
            return True

        # -- a pre-chunk that already exceeds the soft-max is considered "full" --
        if self._text_length > self._opts.soft_max:
            return False

        # -- don't add an element if it would increase total size beyond the hard-max --
        return not self._remaining_space < len(element.text)

With this change, if the element type is Table, it always fits in an empty pre-chunk.

If you have any suggestions or if I have any misunderstanding, please don't hesitate to let me know.
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions