Skip to content

Commit 82c8adb

Browse files
scannyshreyanid
andauthored
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the chunk-overlap feature revealed a bug causing split-chunks to be inserted out-of-order. For example, elements like this: ``` Text("One" + 400 chars) Text("Two" + 400 chars) Text("Three" + 600 chars) Text("Four" + 400 chars) Text("Five" + 600 chars) ``` Should produce chunks: ``` CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("Four") # (400 chars) CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) ``` but produced this instead: ``` CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Four") # (400 chars) ``` This PR fixes that behavior that was introduced on Oct 9 this year in commit: f98d5e6 when adding chunk splitting. **Technical Summary** The essential transformation of chunking is: ``` elements sections chunks List[Element] -> List[List[Element]] -> List[CompositeElement] ``` 1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_ semantically-related elements into _sections_ (`List[Element]`), in the best case, that would be a title (heading) and the text that follows it (until the next title). A heading and its text is often referred to as a _section_ in publishing parlance, hence the name. 2. The _chunker_ (`chunk_by_title()` currently) does two things: 1. first it _consolidates_ the elements of each section into a single `ConsolidatedElement` object (a "chunk"). This includes both joining the element text into a single string as well as consolidating the metadata of the section elements. 2. then if necessary it _splits_ the chunk into two or more `ConsolidatedElement` objects when the consolidated text is too long to fit in the specified window (`max_characters`). Chunk splitting is only required when a single element (like a big paragraph) has text longer than the specified window. Otherwise a section and the chunk that derives from it reflects an even element boundary. `chunk_by_title()` was elaborated in commit f98d5e6 to add this "chunk-splitting" behavior. At the time there was some notion of wanting to "split from the end backward" such that any small remainder chunk would appear first, and could possibly be combined with a small prior chunk. To accomplish this, split chunks were _inserted_ at the beginning of the list instead of _appended_ to the end. The `chunked_elements` variable (`List[CompositeElement]`) holds the sequence of chunks that result from the chunking operation and is the returned value for `chunk_by_title()`. This was the list "split-from-the-end" chunks were inserted at the beginning of and that unfortunately produces this out-of-order behavior because the insertion was at the beginning of this "all-chunks-in-document" list, not a sublist just for this chunk. Further, the "split-from-the-end" behavior can produce no benefit because chunks are never combined, only _elements_ are combined (across semantic boundaries into a single section when a section is small) and sectioning occurs _prior_ to chunking. The fix is to rework the chunk-splitting passage to a straighforward iterative algorithm that works both when a chunk must be split and when it doesn't. This algorithm is also very easily extended to implement split-chunk-overlap which is coming up in an immediately following PR. ```python # -- split chunk into CompositeElements objects maxlen or smaller -- text_len = len(text) start = 0 remaining = text_len while remaining > 0: end = min(start + max_characters, text_len) chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta)) start = end - overlap remaining = text_len - end ``` *Forensic analysis* The out-of-order-chunks behavior was introduced in commit 4ea7168 on 10/09/2023 in the same PR in which chunk-splitting was introduced. --------- Co-authored-by: Shreya Nidadavolu <[email protected]> Co-authored-by: shreyanid <[email protected]>
1 parent ce40cdc commit 82c8adb

File tree

4 files changed

+32
-23
lines changed

4 files changed

+32
-23
lines changed

Diff for: CHANGELOG.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.10.25-dev9
1+
## 0.10.25
22

33
### Enhancements
44

@@ -19,10 +19,10 @@ ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the en
1919
* **Fix chunks breaking on regex-metadata matches.** Fixes "over-chunking" when `regex_metadata` was used, where every element that contained a regex-match would start a new chunk.
2020
* **Fix regex-metadata match offsets not adjusted within chunk.** Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
2121
* **Map source cli command configs when destination set** Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
22-
* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding
23-
an `__init__.py` file under the folder.
22+
* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an `__init__.py` file under the folder.
2423
* **Fix a bug when `parition_pdf` get `model_name=None`** In API usage the `model_name` value is `None` and the `cast` function in `partition_pdf` would return `None` and lead to attribution error. Now we use `str` function to explicit convert the content to string so it is garanteed to have `starts_with` and other string functions as attributes
2524
* **Fix html partition fail on tables without `tbody` tag** HTML tables may sometimes just contain headers without body (`tbody` tag)
25+
* **Fix out-of-order sequencing of split chunks.** Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded `max_characters`.
2626

2727
## 0.10.24
2828

Diff for: test_unstructured/chunking/test_title.py

+18
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,24 @@
2323
from unstructured.partition.html import partition_html
2424

2525

26+
def test_it_splits_a_large_section_into_multiple_chunks():
27+
elements: List[Element] = [
28+
Title("Introduction"),
29+
Text(
30+
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
31+
" porta volutpat."
32+
),
33+
]
34+
35+
chunks = chunk_by_title(elements, combine_text_under_n_chars=50, max_characters=50)
36+
37+
assert chunks == [
38+
CompositeElement("Introduction"),
39+
CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing "),
40+
CompositeElement("elit. In rhoncus ipsum sed lectus porta volutpat."),
41+
]
42+
43+
2644
def test_split_elements_by_title_and_table():
2745
elements: List[Element] = [
2846
Title("A Great Day"),

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.10.25-dev9" # pragma: no cover
1+
__version__ = "0.10.25" # pragma: no cover

Diff for: unstructured/chunking/title.py

+10-19
Original file line numberDiff line numberDiff line change
@@ -152,25 +152,16 @@ def chunk_by_title(
152152
chunk_matches.extend(matches)
153153
chunk_regex_metadata[regex_name] = chunk_matches
154154

155-
# Check if text exceeds max_characters
156-
if len(text) > max_characters:
157-
# Chunk the text from the end to the beginning
158-
while len(text) > 0:
159-
if len(text) <= max_characters:
160-
# If the remaining text is shorter than max_characters
161-
# create a chunk from the beginning
162-
chunk_text = text
163-
text = ""
164-
else:
165-
# Otherwise, create a chunk from the end
166-
chunk_text = text[-max_characters:]
167-
text = text[:-max_characters]
168-
169-
# Prepend the chunk to the beginning of the list
170-
chunked_elements.insert(0, CompositeElement(text=chunk_text, metadata=metadata))
171-
else:
172-
# If it doesn't exceed, create a single CompositeElement
173-
chunked_elements.append(CompositeElement(text=text, metadata=metadata))
155+
# -- split chunk into CompositeElements objects maxlen or smaller --
156+
text_len = len(text)
157+
start = 0
158+
remaining = text_len
159+
160+
while remaining > 0:
161+
end = min(start + max_characters, text_len)
162+
chunked_elements.append(CompositeElement(text=text[start:end], metadata=metadata))
163+
start = end
164+
remaining = text_len - end
174165

175166
return chunked_elements
176167

0 commit comments

Comments
 (0)