Skip to content

feat: repeat row headings in each table chunk #3778

Open
@hardchor

Description

@hardchor

Describe the bug
When chunking text with tables in them (using the by_title strategy), tables are split into chunks row-by-row (if max_characters is set sufficiently low). That's great (and aligns with best practices where each row should ideally be in its own chunk). However, now the chunk loses all context for the data in that table row.
Since that context can typically be found in the table header (i.e. typically the first row), I am currently manually going through all rows and prepend the table header (can provide code if needed, but it's not the prettiest solution since I essentially have to parse the text_as_html output and then stitch it back together).

P.S.: I also couldn't get it to produce TableChunk elements, but maybe that's not intended behaviour in this case?

To Reproduce
Run ingestion of any document with a table in it and chunk it using the by_title strategy and a sufficiently small max_characters size).

Expected behavior

  1. If the table header is in a chunk of its own, it doesn't produce a chunk.
  2. Each subsequent table row chunk gets prefixed with the table header.
<table>
  <thead>
    <tr>
      <th>property1</th>
      <th>property2</th>
      <th>property3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>value1</td>
      <td>value2</td>
      <td>value3</td>
    </tr>
  </tbody>
</table>

Metadata

Metadata

Assignees

No one assigned

    Labels

    chunkingRelated to element chunking.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions