Description
When I use chunk_elements on my List[Elements], the Table element is always combined with other elements, resulting in a CompositeElement. In addition, whenever the text_as_html characters exceed the default maximum value, I encounter the same issue.(resulting in a CompositeElement)
According to the official documentation for unstructured regarding the chunk_element function:
A single element that exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text splitting.
A Table element is always isolated and never combined with another element. If a Table is oversized (exceeding the hard-max), it is divided into two or more TableChunk elements using text splitting.
I anticipated that a Table element would never be combined with other elements into a CompositeElement.
One approach is to refactor the will_fit() method as follows:
def will_fit(self, element: Element) -> bool:
# -- if the new element is a Table, it can only fit in an empty pre-chunk --
if isinstance(element, Table):
return len(self._elements) == 0
# -- if the pre-chunk already contains a Table, no additional element should fit --
if any(isinstance(e, Table) for e in self._elements):
return False
# -- an empty pre-chunk will accept any element (including an oversized element) --
if len(self._elements) == 0:
return True
# -- a pre-chunk that already exceeds the soft-max is considered "full" --
if self._text_length > self._opts.soft_max:
return False
# -- don't add an element if it would increase total size beyond the hard-max --
return not self._remaining_space < len(element.text)
With this change, if the element type is Table, it always fits in an empty pre-chunk.
If you have any suggestions or if I have any misunderstanding, please don't hesitate to let me know.
Thanks!