Skip to content

Commit a861ed8

Browse files
authored
feat(chunk): split tables on even row boundaries (#3504)
**Summary** Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. **Additional Context** Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.
1 parent 99f72d6 commit a861ed8

File tree

21 files changed

+998
-135
lines changed

21 files changed

+998
-135
lines changed

Diff for: CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.6-dev0
1+
## 0.15.6-dev1
22

33
### Enhancements
44

@@ -7,6 +7,7 @@
77
### Fixes
88

99
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
10+
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.
1011

1112

1213
## 0.15.5

0 commit comments

Comments
 (0)