Replies: 1 comment
-
|
You can overlap the chunks in case it increases the chunk size of embedding model. Ps: if you want full conversation on this refer to our discord. (It's meant for other developers) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was just wondering exactly how the late chunking method handles documents that are larger than the batch size. Say I have a 16K token document and it gets split in 16 1K token chunks. If I'm using an embedding model with an 8192 token context size and the default 1000 token chunk size and batch size of 8, how does the embedding proceed?
If it simply embeds the first 8 chunks and then pools the token embeddings into the 8 chunk vectors, then the later chunks in that batch will be lacking context from later in the document. Or is there a sliding window happening here where the vectors are only pooled and stored for the center chunks? Maybe including more context from earlier in the document is actually preferred because of how humans tend to write sequentially? I don't know what the best method is, but I'd like to know what the current method is lol
Beta Was this translation helpful? Give feedback.
All reactions