Conversation
| @@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel { | |||
| } | |||
|
|
|||
| async chunkText(content: string) { | |||
There was a problem hiding this comment.
After a bunch of research into how to chunk text for search, I've moved this from a simple chunker to a "sliding window" algorithm that attempts to retain sentence boundaries, where able.
There was a problem hiding this comment.
-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?
If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!
There was a problem hiding this comment.
Ideally, we want to chunk based on what we know of the data. If it's markdown, we want to use a chunker that performs well on markdown. I'd prefer to merge this as is and make that a separate, targeted enhancement
There was a problem hiding this comment.
Do we have some benchmark data we can test this out with? Getting rid of our recursive chunker seems like a step backwards, especially if the new one is going to create more chunks due to overlaps. The recursive chunker handles section headings and paragraph breaks! The new one does not.
That said... why squabble over chunking when we can just do away with most of it??
Qwen3 uses less memory, outputs the same 1024 dimension size, and has 32768 max tokens. That's 64x more tokens
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
| CREATE TABLE IF NOT EXISTS ${sql.id(this.tableName)} ( | ||
| "id" INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, | ||
| "embedText" TEXT, | ||
| "tsv" tsvector, |
There was a problem hiding this comment.
pg-native full-text search
| import type {DataLoaderInstance} from '../../server/dataloader/RootDataLoader' | ||
|
|
||
| export const createTextFromPage = async (pageId: number, dataLoader: DataLoaderInstance) => { | ||
| const page = await dataLoader.get('pagesWithContent').load(pageId) |
There was a problem hiding this comment.
I had to create a new loader because the default loader doesn't return plaintext content. I think it might be better to have renamed the existing loader to pagesNoPlaintext or something to make it clearer what that loader does...
packages/embedder/indexing/page.ts
Outdated
|
|
||
| const {title, plaintextContent} = page | ||
| const parts = [ | ||
| `Title: ${title || 'Untitled'}`, |
There was a problem hiding this comment.
After doing some reading, adding some structured text to be embedded seems to be a best practice...
| // Cannot use summarization strategy if generation model has same context length as embedding model | ||
| // We must split the text & not tokens because BERT tokenizer is not trained for linebreaks e.g. \n\n | ||
| // Delete existing chunks for this metadataID to prevent stale data | ||
| await pg |
There was a problem hiding this comment.
Since Pages are going to be embedded over and over again, we had to put a hook in there someplace to remove old chunks. This seemed to be the right place
There was a problem hiding this comment.
-1 It seems very expensive to start over every time. I think we can do better!
Let's say a page has 1000 chunks.
We don't want to re-embed every single chunk on every keystroke.
After we re-tokenize, let's compare the embedText and only embed if it's changed.
003afb2 to
a804f48
Compare
|
|
||
| const BATCH_SIZE = 100 | ||
|
|
||
| const embedderReIndex: MutationResolvers['embedderReIndex'] = async (_source, {orgIds}) => { |
There was a problem hiding this comment.
It's not clear that we should keep this mutation after releasing to GA, but it sure is handy for development.
a341ecc to
7004cad
Compare
7004cad to
1b061b0
Compare
1b061b0 to
13f6ee4
Compare
| const errors: (JobQueueError | undefined)[] = [] | ||
|
|
||
| for (let i = 0; i < chunks.length; i++) { | ||
| // Pause embedding if a search is active, prioritizing seach over embedding: |
There was a problem hiding this comment.
This is a key change, rather than embed all of these items in parallel, I decided to prioritize latency over throughput now that users will be hitting the embedder to do semantic search. I believe this is the right thing to do.
Now, the embedder will prioritize executing search even over embedding new/updated objects, by using this simple redis priority lock
13f6ee4 to
16d41d4
Compare
| const priority = options?.priority || 'low' | ||
| const retries = options?.retries | ||
|
|
||
| if (priority === 'high') { |
There was a problem hiding this comment.
I've introduced the concept of "high" and "low" priority embeddings. "high" priority are user searches. "low" are catching up on embedding new/updated objects
There was a problem hiding this comment.
-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.
-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?
There was a problem hiding this comment.
getEmbedderPriority(0) was to pick the model, not the priority in embedding queue.
I'll have to think about this just a bit more. The embedder replicas were picking from a queue (the job table) ... but they were processing chunks in blocks. Those blocks sometimes take a long time to complete, blocking an embedder for a long while (timing out user requests). We might need to move to queuing chunks instead of objects of multiple chunks.
I'll need to think on this some more...
There was a problem hiding this comment.
I think the problem is that your search function is cutting the line. It needs that query embedded NOW, which I totally get! ...but the webserver should never call a model directly. If it does, that cuts the line, which seems great at first, but what happens when 2 search queries come in, both high priority? We're back to where we were without any queue at all.
My .02:
- create a new workflow called something like
embedAndReplywhich, after embedding, it replies via pubsub & that resolves a promise - modify
getEmbedderPriorityto handle priorities that are higher than 0 (or use 0 forsearchand make everything that was 0 a 1)
So, a new search comes in, the query gets put at the front of the line & when the job completes, it resolves the promise.
- Changed from global redis lock to embedder semaphore
6b43e1f to
00fd9c6
Compare
| const job = (await getJob(false)) || (await getJob(true)) | ||
| if (!job) { | ||
| // queue is empty, so sleep for a short while (prioritize latency) | ||
| await sleep(250) |
There was a problem hiding this comment.
hitting pg with 2 getJob queries (true, false) every 250ms is ambitious. This is already the most expensive query we have & this change would run it 40x more.
Perhaps instead we subscribe to a channel runNow in redis.
And instead of sleep, we have Promise.race(sleep, wakeUp)
And on redis message for that channel, we resolve wakeUp, if it exists.
And we publish to runNow whenever we add an item to the queue.
i'm sure there's something more eloquent, but the goal here is to not tank our DB performance for the whole app when the queue runs dry.
| await sleep(ms('10s')) | ||
| return this.next() | ||
|
|
||
| while (!this.done) { |
There was a problem hiding this comment.
0 curious why this loop was needed vs. just calling this.next()?
| const priority = options?.priority || 'low' | ||
| const retries = options?.retries | ||
|
|
||
| if (priority === 'high') { |
There was a problem hiding this comment.
-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.
-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?
| @@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel { | |||
| } | |||
|
|
|||
| async chunkText(content: string) { | |||
There was a problem hiding this comment.
-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?
If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!
| .expression(({selectFrom}) => | ||
| selectFrom('EmbeddingsMetadata') | ||
| .select(({ref}) => [ | ||
| .expression((eb: any) => |
There was a problem hiding this comment.
-1 don't take away our type safety!
| .limit(searchLimit) | ||
| .execute() | ||
|
|
||
| // RRF Aggregation (Chunk -> Document) |
There was a problem hiding this comment.
can you move this to a helper function & possibly reuse for lexical & semantic?
| addOrg: rateLimit({perMinute: 2, perHour: 5}), | ||
| addTeam: rateLimit({perMinute: 15, perHour: 50}), | ||
| createImposterToken: isSuperUser, | ||
| embedderReIndex: isSuperUser, |
There was a problem hiding this comment.
-1 search needs some validation and/or a rateLimiter on it
There was a problem hiding this comment.
Let's do that as a fast follow on. I'll create an issue for it
|
|
||
| const MAX_RRF_SCORE = (keywordWeight + vectorWeight) * (1 / (k + 1)) | ||
|
|
||
| const results = metadata |
There was a problem hiding this comment.
-1 break this into a helper function. each objectType should have its own function
| results.sort((a, b) => b.score.relevance - a.score.relevance) | ||
|
|
||
| // Re-rank (Business Rules) | ||
| const reranked = applyBusinessRules(results as any, {query, currentUserId: userId || undefined}) |
There was a problem hiding this comment.
-1 waaaay too much casting as any used throughout. this is totally fine for exploratory coding & creating proof of concepts, but when code has to be shared & maintained, it's a bunch of landmines. I see that you went through the trouble of turning this function into a generic, but then that's not used?
| await conn.disconnect() | ||
| unlock() | ||
|
|
||
| const userId = (context as any).userId |
| 'e.embedText' | ||
| ]) | ||
| .where((eb) => buildFilters(eb)) | ||
| .where(sql<boolean>`e."tsv" @@ plainto_tsquery('english', ${query})`) |
There was a problem hiding this comment.
+1 websearch_to_tsquery may be more appropriate? it'll allow you to quote phrases when 1 word should follow another, etc.
|
Just some notes to myself:
Gameplan:
|
|
@mattkrick closing this now! |
Description
Implements search backend behind a feature flag. This PR is largely for discussion and testing (esp. on staging); but ultimately should be safe to merge.
Demo
Loom: https://www.loom.com/share/6ea18954e7314cc8a2fa8fd2c6b9d6f6
Testing scenarios
[Please list all the testing scenarios a reviewer has to check before approving the PR]
Scenario A
Scenario B
Final checklist