Skip to content

feat(chunk): split on sentence boundaries #3484

Open
@jgen1

Description

@jgen1

Problem

The current chunking text-splitting solution is breaking up sentences in the middle of the sentence if that chunk is greater than the max character count. It is nice that it won't break up words because it is using text_splitting_separators, which by default are "\n" and " ", but I would like it to not break up sentences if it can avoid it.

Example

string = "This is a test string. This is the second sentence. Here is the third."

If max_characters = 30 (30th character is the s in "is" in the second sentence):
CURRENT TEXT-SPLITTING
string_1 = "This is a test string. This is"
string_2 = "the second sentence. Here is"
string_3 = "the third"

But I would like:
PREFERRED TEXT-SPLITTING
string_1 = "This is a test string."
string_2 = "This is the second sentence."
string_3 = "Here is the third."

This kind of splitting is also in line with the idea of unstructured.io -- trying to maintain sections and semantic meaning as best as possible in chunks, constrained by various parameters. It is clearly better in this case to text-split in this way.

Solution

This text_splitting_seperators argument is already a key-word option in the ChunkingOptions class:

@lazyproperty
    def text_splitting_separators(self) -> tuple[str, ...]:
        """Sequence of text-splitting target strings to be used in order of preference."""
        text_splitting_separators_arg = self._kwargs.get("text_splitting_separators")
        return (
            ("\n", " ")
            if text_splitting_separators_arg is None
            else tuple(text_splitting_separators_arg)
        )

So I would like this to be a passable option through the chunk_by_title and/or chunk_elements function. Essentially, to add text_splitting_separators as an argument to the chunking function that defaults to ["\n", " "].

I understand if you do not want to expose this much to the user, who in many cases may not care about this level of detail. If that is the case, is there any way to do "soft" support of it where it may not be directly exposed but I can input it as an argument? Or some other way to allow me to edit this argument without making a fork of the project? Maybe the default should be ["\n", ".", " "]?

@scanny seems like you are the person to talk to about this

Metadata

Metadata

Assignees

No one assigned

    Labels

    chunkingRelated to element chunking.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions