What units to use for threshold amount? #26171

mvirag2000 · 2024-09-06T21:36:29Z

URL

https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/

Checklist

I added a very descriptive title to this issue.
I included a link to the documentation page I am referring to (if applicable).

Issue with current documentation:

It seems that units for threshold-type = "percentage" are out of a hundred, i.e., 85.0 not 0.85, and this is also unclear for the other threshold types, "gradient," and "interquartile."

Idea or request for content:

Also, Semantic Chunker really needs a min and max chunk size. I am getting chunks of a single word, and chunks that exceed the OpenAI limit. Thanks for all the great work on LangChain.

tibor-reiss · 2024-09-15T09:11:02Z

@mvirag2000 What do you think about the linked PR?
Re your idea/request: I only introduced min_chunk_size, because the max size of chunks can be adjusted by tuning breakpoint_threshould_amount to a reasonable value.

dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Sep 6, 2024

tibor-reiss linked a pull request Sep 12, 2024 that will close this issue

docs[experimental]: Make docs clearer and add min_chunk_size #26398

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What units to use for threshold amount? #26171

What units to use for threshold amount? #26171

mvirag2000 commented Sep 6, 2024

tibor-reiss commented Sep 15, 2024

What units to use for threshold amount? #26171

What units to use for threshold amount? #26171

Comments

mvirag2000 commented Sep 6, 2024

URL

Checklist

Issue with current documentation:

Idea or request for content:

tibor-reiss commented Sep 15, 2024