Skip to content

feat: search backend behind feature flag#12422

Closed
jordanh wants to merge 3 commits intomasterfrom
feat/search-backend
Closed

feat: search backend behind feature flag#12422
jordanh wants to merge 3 commits intomasterfrom
feat/search-backend

Conversation

@jordanh
Copy link
Contributor

@jordanh jordanh commented Dec 9, 2025

Description

Implements search backend behind a feature flag. This PR is largely for discussion and testing (esp. on staging); but ultimately should be safe to merge.

Demo

Loom: https://www.loom.com/share/6ea18954e7314cc8a2fa8fd2c6b9d6f6

Testing scenarios

[Please list all the testing scenarios a reviewer has to check before approving the PR]

  • Scenario A

    • Step 1
    • Step 2...
  • Scenario B

    • Step 1
    • Step 2....

Final checklist

  • I checked the code review guidelines
  • I have added Metrics Representative as reviewer(s) if my PR invovles metrics/data/analytics related changes
  • I have performed a self-review of my code, the same way I'd do it for any other team member
  • I have tested all cases I listed in the testing scenarios and I haven't found any issues or regressions
  • Whenever I took a non-obvious choice I added a comment explaining why I did it this way
  • I added the label Skip Maintainer Review Indicating the PR only requires reviewer review and can be merged right after it's approved if the PR introduces only minor changes, does not contain any architectural changes or does not introduce any new patterns and I think one review is sufficient'
  • PR title is human readable and could be used in changelog

@@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel {
}

async chunkText(content: string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bunch of research into how to chunk text for search, I've moved this from a simple chunker to a "sliding window" algorithm that attempts to retain sentence boundaries, where able.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?

If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we want to chunk based on what we know of the data. If it's markdown, we want to use a chunker that performs well on markdown. I'd prefer to merge this as is and make that a separate, targeted enhancement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some benchmark data we can test this out with? Getting rid of our recursive chunker seems like a step backwards, especially if the new one is going to create more chunks due to overlaps. The recursive chunker handles section headings and paragraph breaks! The new one does not.

That said... why squabble over chunking when we can just do away with most of it??

Qwen3 uses less memory, outputs the same 1024 dimension size, and has 32768 max tokens. That's 64x more tokens
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

CREATE TABLE IF NOT EXISTS ${sql.id(this.tableName)} (
"id" INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
"embedText" TEXT,
"tsv" tsvector,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg-native full-text search

import type {DataLoaderInstance} from '../../server/dataloader/RootDataLoader'

export const createTextFromPage = async (pageId: number, dataLoader: DataLoaderInstance) => {
const page = await dataLoader.get('pagesWithContent').load(pageId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to create a new loader because the default loader doesn't return plaintext content. I think it might be better to have renamed the existing loader to pagesNoPlaintext or something to make it clearer what that loader does...


const {title, plaintextContent} = page
const parts = [
`Title: ${title || 'Untitled'}`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some reading, adding some structured text to be embedded seems to be a best practice...

// Cannot use summarization strategy if generation model has same context length as embedding model
// We must split the text & not tokens because BERT tokenizer is not trained for linebreaks e.g. \n\n
// Delete existing chunks for this metadataID to prevent stale data
await pg
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Pages are going to be embedded over and over again, we had to put a hook in there someplace to remove old chunks. This seemed to be the right place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 It seems very expensive to start over every time. I think we can do better!
Let's say a page has 1000 chunks.
We don't want to re-embed every single chunk on every keystroke.
After we re-tokenize, let's compare the embedText and only embed if it's changed.

@github-actions github-actions bot added size/xl and removed size/xl labels Dec 9, 2025
@jordanh jordanh force-pushed the feat/search-backend branch from 003afb2 to a804f48 Compare December 11, 2025 15:00
@github-actions github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

const BATCH_SIZE = 100

const embedderReIndex: MutationResolvers['embedderReIndex'] = async (_source, {orgIds}) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear that we should keep this mutation after releasing to GA, but it sure is handy for development.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private schema?

const errors: (JobQueueError | undefined)[] = []

for (let i = 0; i < chunks.length; i++) {
// Pause embedding if a search is active, prioritizing seach over embedding:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key change, rather than embed all of these items in parallel, I decided to prioritize latency over throughput now that users will be hitting the embedder to do semantic search. I believe this is the right thing to do.

Now, the embedder will prioritize executing search even over embedding new/updated objects, by using this simple redis priority lock

@jordanh jordanh force-pushed the feat/search-backend branch from 13f6ee4 to 16d41d4 Compare December 11, 2025 16:50
@github-actions github-actions bot added size/xl and removed size/xl labels Dec 11, 2025
@jordanh jordanh changed the title Feat: search backend (WIP) feat: search backend (WIP) Dec 11, 2025
@jordanh jordanh changed the title feat: search backend (WIP) feat: search backend behind feature flag (WIP) Dec 11, 2025
@github-actions github-actions bot added size/xl and removed size/xl labels Dec 11, 2025
const priority = options?.priority || 'low'
const retries = options?.retries

if (priority === 'high') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced the concept of "high" and "low" priority embeddings. "high" priority are user searches. "low" are catching up on embedding new/updated objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.

-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEmbedderPriority(0) was to pick the model, not the priority in embedding queue.

I'll have to think about this just a bit more. The embedder replicas were picking from a queue (the job table) ... but they were processing chunks in blocks. Those blocks sometimes take a long time to complete, blocking an embedder for a long while (timing out user requests). We might need to move to queuing chunks instead of objects of multiple chunks.

I'll need to think on this some more...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that your search function is cutting the line. It needs that query embedded NOW, which I totally get! ...but the webserver should never call a model directly. If it does, that cuts the line, which seems great at first, but what happens when 2 search queries come in, both high priority? We're back to where we were without any queue at all.

My .02:

  • create a new workflow called something like embedAndReply which, after embedding, it replies via pubsub & that resolves a promise
  • modify getEmbedderPriority to handle priorities that are higher than 0 (or use 0 for search and make everything that was 0 a 1)

So, a new search comes in, the query gets put at the front of the line & when the job completes, it resolves the promise.

- Changed from global redis lock to embedder semaphore
@jordanh jordanh force-pushed the feat/search-backend branch from 6b43e1f to 00fd9c6 Compare December 11, 2025 17:47
@github-actions github-actions bot added size/xl and removed size/xl labels Dec 11, 2025
@jordanh jordanh changed the title feat: search backend behind feature flag (WIP) feat: search backend behind feature flag Dec 11, 2025
const job = (await getJob(false)) || (await getJob(true))
if (!job) {
// queue is empty, so sleep for a short while (prioritize latency)
await sleep(250)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hitting pg with 2 getJob queries (true, false) every 250ms is ambitious. This is already the most expensive query we have & this change would run it 40x more.
Perhaps instead we subscribe to a channel runNow in redis.
And instead of sleep, we have Promise.race(sleep, wakeUp)
And on redis message for that channel, we resolve wakeUp, if it exists.
And we publish to runNow whenever we add an item to the queue.
i'm sure there's something more eloquent, but the goal here is to not tank our DB performance for the whole app when the queue runs dry.

await sleep(ms('10s'))
return this.next()

while (!this.done) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 curious why this loop was needed vs. just calling this.next()?

const priority = options?.priority || 'low'
const retries = options?.retries

if (priority === 'high') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.

-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?

@@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel {
}

async chunkText(content: string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?

If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!

.expression(({selectFrom}) =>
selectFrom('EmbeddingsMetadata')
.select(({ref}) => [
.expression((eb: any) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 don't take away our type safety!

.limit(searchLimit)
.execute()

// RRF Aggregation (Chunk -> Document)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to a helper function & possibly reuse for lexical & semantic?

addOrg: rateLimit({perMinute: 2, perHour: 5}),
addTeam: rateLimit({perMinute: 15, perHour: 50}),
createImposterToken: isSuperUser,
embedderReIndex: isSuperUser,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 search needs some validation and/or a rateLimiter on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that as a fast follow on. I'll create an issue for it


const MAX_RRF_SCORE = (keywordWeight + vectorWeight) * (1 / (k + 1))

const results = metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 break this into a helper function. each objectType should have its own function

results.sort((a, b) => b.score.relevance - a.score.relevance)

// Re-rank (Business Rules)
const reranked = applyBusinessRules(results as any, {query, currentUserId: userId || undefined})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 waaaay too much casting as any used throughout. this is totally fine for exploratory coding & creating proof of concepts, but when code has to be shared & maintained, it's a bunch of landmines. I see that you went through the trouble of turning this function into a generic, but then that's not used?

await conn.disconnect()
unlock()

const userId = (context as any).userId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 no as any

'e.embedText'
])
.where((eb) => buildFilters(eb))
.where(sql<boolean>`e."tsv" @@ plainto_tsquery('english', ${query})`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 websearch_to_tsquery may be more appropriate? it'll allow you to quote phrases when 1 word should follow another, etc.

@mattkrick
Copy link
Member

mattkrick commented Jan 9, 2026

Just some notes to myself:

  • Putting userId on EmbeddingsMetadata doesn't help us because Page.userId is only the person who created the page, not the people who can access it.
  • EmbeddingsMetadata for pages isn't necessary. We should be able to grab all that data from the Page table itself
  • If we can first get all the Pages a user has access to, then we can perform a cosine similarity from there. We just need a pageId column on the Embeddings_n table.
  • Part of me wonders if we should have 1 embeddings table per object type
  • discussion topics could generate their text and language and store them on ReflectionGroup
  • meeting templates could do the same thing
  • pages already have their full text persisted at all times
  • I think an EmbeddingsPages_modelId table makes the most sense. This is because when you perform a search, you're probably searching for a document. Sure, if you want to search across documents and discussion topics, we can perform 2 searches simultaneously, but then we can use different models, the indexed vectors will take up less RAM, and we don't need to have a polymorphic objectType='page'/pageId column. It also simplifies development because each objectType will be a different query & we just patch them together without the need for really complex queries that use RBAC for some objectTypes and teamId for others.

Gameplan:

  • remove plaintextContent from Page
  • Add pageId to the JobQueue
  • put a unique constraint on pageId + model on the JobQueue. that way we attempt to enqueue updates all we want, but only the first will get added. that first will then fetch the latest value
  • always write to embedText on the embedding table. This means we don't have a fullText, plaintext version of the page anywhere. I think that's OK because we don't use it anywhere except debugging
  • This also means we only need to join PageAccess and Embeddings_qwen3! This is a huge win for performance
  • we can also put a updatedAt on the Embeddings_ table. This could be a good sanity check where we don't want to embed something that was updatedAt recently, but that would require hitting the DB & we don't really know what to compare it to. Page.updatedAt is always going to be >= Embeddings_.updatedAt.

@jordanh
Copy link
Contributor Author

jordanh commented Jan 27, 2026

@mattkrick closing this now!

@jordanh jordanh closed this Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants