Description
Problem
The current chunking text-splitting solution is breaking up sentences in the middle of the sentence if that chunk is greater than the max character count. It is nice that it won't break up words because it is using text_splitting_separators
, which by default are "\n" and " ", but I would like it to not break up sentences if it can avoid it.
Example
string = "This is a test string. This is the second sentence. Here is the third."
If max_characters = 30 (30th character is the s in "is" in the second sentence):
CURRENT TEXT-SPLITTING
string_1 = "This is a test string. This is"
string_2 = "the second sentence. Here is"
string_3 = "the third"
But I would like:
PREFERRED TEXT-SPLITTING
string_1 = "This is a test string."
string_2 = "This is the second sentence."
string_3 = "Here is the third."
This kind of splitting is also in line with the idea of unstructured.io -- trying to maintain sections and semantic meaning as best as possible in chunks, constrained by various parameters. It is clearly better in this case to text-split in this way.
Solution
This text_splitting_seperators
argument is already a key-word option in the ChunkingOptions class:
@lazyproperty
def text_splitting_separators(self) -> tuple[str, ...]:
"""Sequence of text-splitting target strings to be used in order of preference."""
text_splitting_separators_arg = self._kwargs.get("text_splitting_separators")
return (
("\n", " ")
if text_splitting_separators_arg is None
else tuple(text_splitting_separators_arg)
)
So I would like this to be a passable option through the chunk_by_title and/or chunk_elements function. Essentially, to add text_splitting_separators as an argument to the chunking function that defaults to ["\n", " "].
I understand if you do not want to expose this much to the user, who in many cases may not care about this level of detail. If that is the case, is there any way to do "soft" support of it where it may not be directly exposed but I can input it as an argument? Or some other way to allow me to edit this argument without making a fork of the project? Maybe the default should be ["\n", ".", " "]?
@scanny seems like you are the person to talk to about this