feat: search backend behind feature flag by jordanh · Pull Request #12422 · ParabolInc/parabol

jordanh · 2025-12-09T15:04:37Z

Description

Implements search backend behind a feature flag. This PR is largely for discussion and testing (esp. on staging); but ultimately should be safe to merge.

Demo

Loom: https://www.loom.com/share/6ea18954e7314cc8a2fa8fd2c6b9d6f6

Testing scenarios

[Please list all the testing scenarios a reviewer has to check before approving the PR]

Scenario A
- Step 1
- Step 2...
Scenario B
- Step 1
- Step 2....

Final checklist

I checked the code review guidelines
I have added Metrics Representative as reviewer(s) if my PR invovles metrics/data/analytics related changes
I have performed a self-review of my code, the same way I'd do it for any other team member
I have tested all cases I listed in the testing scenarios and I haven't found any issues or regressions
Whenever I took a non-obvious choice I added a comment explaining why I did it this way
I added the label Skip Maintainer Review Indicating the PR only requires reviewer review and can be merged right after it's approved if the PR introduces only minor changes, does not contain any architectural changes or does not introduce any new patterns and I think one review is sufficient'
PR title is human readable and could be used in changelog

jordanh · 2025-12-09T15:14:29Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

@@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel {
  }

  async chunkText(content: string) {


After a bunch of research into how to chunk text for search, I've moved this from a simple chunker to a "sliding window" algorithm that attempts to retain sentence boundaries, where able.

-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?

If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!

Ideally, we want to chunk based on what we know of the data. If it's markdown, we want to use a chunker that performs well on markdown. I'd prefer to merge this as is and make that a separate, targeted enhancement

Do we have some benchmark data we can test this out with? Getting rid of our recursive chunker seems like a step backwards, especially if the new one is going to create more chunks due to overlaps. The recursive chunker handles section headings and paragraph breaks! The new one does not.

That said... why squabble over chunking when we can just do away with most of it??

Qwen3 uses less memory, outputs the same 1024 dimension size, and has 32768 max tokens. That's 64x more tokens
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

jordanh · 2025-12-09T15:22:17Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

        CREATE TABLE IF NOT EXISTS ${sql.id(this.tableName)} (
          "id" INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
          "embedText" TEXT,
+          "tsv" tsvector,


pg-native full-text search

jordanh · 2025-12-09T15:24:53Z

packages/embedder/indexing/page.ts

+import type {DataLoaderInstance} from '../../server/dataloader/RootDataLoader'
+
+export const createTextFromPage = async (pageId: number, dataLoader: DataLoaderInstance) => {
+  const page = await dataLoader.get('pagesWithContent').load(pageId)


I had to create a new loader because the default loader doesn't return plaintext content. I think it might be better to have renamed the existing loader to pagesNoPlaintext or something to make it clearer what that loader does...

jordanh · 2025-12-09T15:25:27Z

packages/embedder/indexing/page.ts

+
+  const {title, plaintextContent} = page
+  const parts = [
+    `Title: ${title || 'Untitled'}`,


After doing some reading, adding some structured text to be embedded seems to be a best practice...

jordanh · 2025-12-09T15:27:33Z

packages/embedder/workflows/embedMetadata.ts

  // Cannot use summarization strategy if generation model has same context length as embedding model
  // We must split the text & not tokens because BERT tokenizer is not trained for linebreaks e.g. \n\n
+  // Delete existing chunks for this metadataID to prevent stale data
+  await pg


Since Pages are going to be embedded over and over again, we had to put a hook in there someplace to remove old chunks. This seemed to be the right place

-1 It seems very expensive to start over every time. I think we can do better!
Let's say a page has 1000 chunks.
We don't want to re-embed every single chunk on every keystroke.
After we re-tokenize, let's compare the embedText and only embed if it's changed.

jordanh · 2025-12-11T15:45:21Z

packages/server/graphql/public/mutations/embedderReIndex.ts

+
+const BATCH_SIZE = 100
+
+const embedderReIndex: MutationResolvers['embedderReIndex'] = async (_source, {orgIds}) => {


It's not clear that we should keep this mutation after releasing to GA, but it sure is handy for development.

private schema?

jordanh · 2025-12-11T16:42:35Z

packages/embedder/workflows/embedMetadata.ts

+  const errors: (JobQueueError | undefined)[] = []
+
+  for (let i = 0; i < chunks.length; i++) {
+    // Pause embedding if a search is active, prioritizing seach over embedding:


This is a key change, rather than embed all of these items in parallel, I decided to prioritize latency over throughput now that users will be hitting the embedder to do semantic search. I believe this is the right thing to do.

Now, the embedder will prioritize executing search even over embedding new/updated objects, by using this simple redis priority lock

jordanh · 2025-12-11T17:43:29Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

+    const priority = options?.priority || 'low'
+    const retries = options?.retries
+
+    if (priority === 'high') {


I've introduced the concept of "high" and "low" priority embeddings. "high" priority are user searches. "low" are catching up on embedding new/updated objects

-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.

-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?

getEmbedderPriority(0) was to pick the model, not the priority in embedding queue.

I'll have to think about this just a bit more. The embedder replicas were picking from a queue (the job table) ... but they were processing chunks in blocks. Those blocks sometimes take a long time to complete, blocking an embedder for a long while (timing out user requests). We might need to move to queuing chunks instead of objects of multiple chunks.

I'll need to think on this some more...

I think the problem is that your search function is cutting the line. It needs that query embedded NOW, which I totally get! ...but the webserver should never call a model directly. If it does, that cuts the line, which seems great at first, but what happens when 2 search queries come in, both high priority? We're back to where we were without any queue at all.

My .02:

create a new workflow called something like embedAndReply which, after embedding, it replies via pubsub & that resolves a promise

modify getEmbedderPriority to handle priorities that are higher than 0 (or use 0 for search and make everything that was 0 a 1)

So, a new search comes in, the query gets put at the front of the line & when the job completes, it resolves the promise.

- Changed from global redis lock to embedder semaphore

mattkrick · 2025-12-11T23:20:40Z

packages/embedder/EmbeddingsJobQueueStream.ts

+        const job = (await getJob(false)) || (await getJob(true))
+        if (!job) {
+          // queue is empty, so sleep for a short while (prioritize latency)
+          await sleep(250)


hitting pg with 2 getJob queries (true, false) every 250ms is ambitious. This is already the most expensive query we have & this change would run it 40x more.
Perhaps instead we subscribe to a channel runNow in redis.
And instead of sleep, we have Promise.race(sleep, wakeUp)
And on redis message for that channel, we resolve wakeUp, if it exists.
And we publish to runNow whenever we add an item to the queue.
i'm sure there's something more eloquent, but the goal here is to not tank our DB performance for the whole app when the queue runs dry.

mattkrick · 2025-12-11T23:22:04Z

packages/embedder/EmbeddingsJobQueueStream.ts

-        await sleep(ms('10s'))
-        return this.next()
+
+    while (!this.done) {


0 curious why this loop was needed vs. just calling this.next()?

mattkrick · 2025-12-11T23:35:28Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

+    const priority = options?.priority || 'low'
+    const retries = options?.retries
+
+    if (priority === 'high') {


-1 doing it this way means we cannot run multiple replicas in the future since we're keeping a count of high priority embeddings on 1 replica instead of in a distributed state.

-1 The need to have a priority on top of the priority that is assigned to the item in the job queue tells me the system isn't working. What do we need to change about getEmbedderPriority(0) in order to make it fit your use case?

mattkrick · 2025-12-12T00:10:20Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

@@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel {
  }

  async chunkText(content: string) {


-1 a Recursive Character Level Chunking algo (this one) also retains sentence boundaries & where appropriate it merges multiple sentences, paragraphs, sections, etc. which means fewer chunks and less repeated data. Why not both? Wouldn't it make more sense to add an overlap parameter?

If you really don't like the recursive approach, we should build a separate chunker & let the data dictate which one to use, because from what i've seen, a recursive char chunker typically outperforms a naive sliding window, but a recursive chunk with a sliding window isn't a bad option!

mattkrick · 2025-12-12T00:12:03Z

packages/embedder/ai_models/AbstractEmbeddingsModel.ts

-      .expression(({selectFrom}) =>
-        selectFrom('EmbeddingsMetadata')
-          .select(({ref}) => [
+      .expression((eb: any) =>


-1 don't take away our type safety!

mattkrick · 2025-12-12T03:15:17Z

packages/server/graphql/public/queries/search.ts

+    .limit(searchLimit)
+    .execute()
+
+  // RRF Aggregation (Chunk -> Document)


can you move this to a helper function & possibly reuse for lexical & semantic?

mattkrick · 2025-12-12T03:16:57Z

packages/server/graphql/public/permissions.ts

    addOrg: rateLimit({perMinute: 2, perHour: 5}),
    addTeam: rateLimit({perMinute: 15, perHour: 50}),
    createImposterToken: isSuperUser,
+    embedderReIndex: isSuperUser,


-1 search needs some validation and/or a rateLimiter on it

Let's do that as a fast follow on. I'll create an issue for it

mattkrick · 2025-12-12T03:21:14Z

packages/server/graphql/public/queries/search.ts

+
+  const MAX_RRF_SCORE = (keywordWeight + vectorWeight) * (1 / (k + 1))
+
+  const results = metadata


-1 break this into a helper function. each objectType should have its own function

mattkrick · 2025-12-12T03:25:44Z

packages/server/graphql/public/queries/search.ts

+  results.sort((a, b) => b.score.relevance - a.score.relevance)
+
+  // Re-rank (Business Rules)
+  const reranked = applyBusinessRules(results as any, {query, currentUserId: userId || undefined})


-1 waaaay too much casting as any used throughout. this is totally fine for exploratory coding & creating proof of concepts, but when code has to be shared & maintained, it's a bunch of landmines. I see that you went through the trouble of turning this function into a generic, but then that's not used?

mattkrick · 2025-12-12T03:27:31Z

packages/server/graphql/mutations/helpers/summaryPage/streamSummaryBlocksToPage.ts

  await conn.disconnect()
  unlock()
+
+  const userId = (context as any).userId


-1 no as any

mattkrick · 2025-12-15T16:57:38Z

packages/server/graphql/public/queries/search.ts

+      'e.embedText'
+    ])
+    .where((eb) => buildFilters(eb))
+    .where(sql<boolean>`e."tsv" @@ plainto_tsquery('english', ${query})`)


+1 websearch_to_tsquery may be more appropriate? it'll allow you to quote phrases when 1 word should follow another, etc.

mattkrick · 2026-01-09T00:01:23Z

Just some notes to myself:

Putting userId on EmbeddingsMetadata doesn't help us because Page.userId is only the person who created the page, not the people who can access it.
EmbeddingsMetadata for pages isn't necessary. We should be able to grab all that data from the Page table itself
If we can first get all the Pages a user has access to, then we can perform a cosine similarity from there. We just need a pageId column on the Embeddings_n table.
Part of me wonders if we should have 1 embeddings table per object type
discussion topics could generate their text and language and store them on ReflectionGroup
meeting templates could do the same thing
pages already have their full text persisted at all times
I think an EmbeddingsPages_modelId table makes the most sense. This is because when you perform a search, you're probably searching for a document. Sure, if you want to search across documents and discussion topics, we can perform 2 searches simultaneously, but then we can use different models, the indexed vectors will take up less RAM, and we don't need to have a polymorphic objectType='page'/pageId column. It also simplifies development because each objectType will be a different query & we just patch them together without the need for really complex queries that use RBAC for some objectTypes and teamId for others.

Gameplan:

remove plaintextContent from Page
Add pageId to the JobQueue
put a unique constraint on pageId + model on the JobQueue. that way we attempt to enqueue updates all we want, but only the first will get added. that first will then fetch the latest value
always write to embedText on the embedding table. This means we don't have a fullText, plaintext version of the page anywhere. I think that's OK because we don't use it anywhere except debugging
This also means we only need to join PageAccess and Embeddings_qwen3! This is a huge win for performance
we can also put a updatedAt on the Embeddings_ table. This could be a good sanity check where we don't want to embed something that was updatedAt recently, but that would require hitting the DB & we don't really know what to compare it to. Page.updatedAt is always going to be >= Embeddings_.updatedAt.

jordanh · 2026-01-27T15:32:23Z

@mattkrick closing this now!

github-actions bot added the size/xl label Dec 9, 2025

jordanh commented Dec 9, 2025

View reviewed changes

github-actions bot added size/xl and removed size/xl labels Dec 9, 2025

jordanh force-pushed the feat/search-backend branch from 003afb2 to a804f48 Compare December 11, 2025 15:00

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh commented Dec 11, 2025

View reviewed changes

jordanh force-pushed the feat/search-backend branch from a341ecc to 7004cad Compare December 11, 2025 16:15

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh force-pushed the feat/search-backend branch from 7004cad to 1b061b0 Compare December 11, 2025 16:23

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh force-pushed the feat/search-backend branch from 1b061b0 to 13f6ee4 Compare December 11, 2025 16:27

github-actions bot removed the size/xl label Dec 11, 2025

github-actions bot added the size/xl label Dec 11, 2025

jordanh commented Dec 11, 2025

View reviewed changes

feature-complete search backend

16d41d4

jordanh force-pushed the feat/search-backend branch from 13f6ee4 to 16d41d4 Compare December 11, 2025 16:50

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh changed the title ~~Feat: search backend (WIP)~~ feat: search backend (WIP) Dec 11, 2025

jordanh changed the title ~~feat: search backend (WIP)~~ feat: search backend behind feature flag (WIP) Dec 11, 2025

fix: server path alias

3c96156

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh commented Dec 11, 2025

View reviewed changes

self-review

00fd9c6

- Changed from global redis lock to embedder semaphore

jordanh force-pushed the feat/search-backend branch from 6b43e1f to 00fd9c6 Compare December 11, 2025 17:47

github-actions bot added size/xl and removed size/xl labels Dec 11, 2025

jordanh requested review from Dschoordsch and mattkrick December 11, 2025 19:11

jordanh changed the title ~~feat: search backend behind feature flag (WIP)~~ feat: search backend behind feature flag Dec 11, 2025

mattkrick requested changes Dec 12, 2025

View reviewed changes

mattkrick reviewed Dec 15, 2025

View reviewed changes

jordanh closed this Jan 27, 2026

		@@ -42,90 +41,164 @@ export abstract class AbstractEmbeddingsModel extends AbstractModel {
		}

		async chunkText(content: string) {


		const BATCH_SIZE = 100

		const embedderReIndex: MutationResolvers['embedderReIndex'] = async (_source, {orgIds}) => {


		const MAX_RRF_SCORE = (keywordWeight + vectorWeight) * (1 / (k + 1))

		const results = metadata

Conversation

jordanh commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Demo

Testing scenarios

Final checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattkrick commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordanh commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordanh commented Dec 9, 2025 •

edited

Loading

mattkrick commented Jan 9, 2026 •

edited

Loading