Skip to content

feat: full-text thread search#8288

Open
FedeCuci wants to merge 2 commits into
janhq:mainfrom
FedeCuci:feat/full-text-thread-search
Open

feat: full-text thread search#8288
FedeCuci wants to merge 2 commits into
janhq:mainfrom
FedeCuci:feat/full-text-thread-search

Conversation

@FedeCuci
Copy link
Copy Markdown

@FedeCuci FedeCuci commented Jun 6, 2026

Describe Your Changes

Previously, the search bar only searched through chat titles. This meant if you remembered discussing something in a conversation but didn't remember the title, you couldn't find it. Now the search indexes all message content across all your chats, so searching for a topic like "python async" will find any thread where that topic was discussed even if the title is something generic like "Code Help". More specifically:

  • Lazy-build full-text search corpus of message content when search dialog opens
  • Incremental invalidation on message add/update/delete for near real-time updates
  • Loading spinner shown while building index to keep user informed
  • Strict substring matching for title and content search (no fuzzy false positives)
  • Debounced search input for smooth UI performance

Self Checklist

  • [x ] Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

…cator

- Lazy-build full-text search corpus of message content when search dialog opens
- Incremental invalidation on message add/update/delete for near real-time updates
- Loading spinner shown while building index to keep user informed
- Strict substring matching for title and content search (no fuzzy false positives)
- Debounced search input for smooth UI performance
@tokamak-pm
Copy link
Copy Markdown

tokamak-pm Bot commented Jun 7, 2026

PR Review: feat: full-text thread search

Summary

This PR adds full-text search across message content in the thread search dialog. Previously, search was title-only (via fuzzy matching with fzf). Now a ThreadSearchIndex singleton lazily loads all message content into an in-memory corpus and supports substring matching across both titles and message bodies. Key additions:

  • New web-app/src/lib/search-index.ts -- ThreadSearchIndex class (singleton) with lazy build, batched fetching, incremental invalidation, and substring search.
  • SearchDialog.tsx -- Integrates the new index with debounced search, loading spinner, and content snippet display.
  • useMessages.ts / useThreads.ts -- Hooks into add/update/delete/clear operations to invalidate or evict index entries.

4 files changed, +358 / -17 lines, 1 commit.


Detailed Findings

1. Correctness Issues

hasPendingWork does not detect new threads (Bug)

The hasPendingWork getter only checks if the index is null, if there are stale thread IDs, or if there are deleted thread IDs. It does not check whether there are threads in the threads record that are not yet in the index (i.e., newly created threads after initial build). This means when a user creates a new thread, adds messages, and reopens the search dialog, the index will report hasPendingWork = false and skip building, so the new thread will never be searchable until a full invalidation occurs.

The doBuild() method does handle new threads correctly (it checks isNew), but hasPendingWork gates whether build() is called at all from the useEffect in SearchDialog.tsx. This is a functional bug.

Suggested fix: hasPendingWork should accept (or store) the thread record so it can compare against entriesByThreadId.keys(), or the useEffect should always call build() and let doBuild() decide if work is needed.

Fallback logic can show stale/duplicate results

In searchResults useMemo:

if (fullTextResults.length > 0) {
  filteredThreads = fullTextResults.map((r) => r.thread)
} else {
  filteredThreads = getFilteredThreads(searchQuery)
}

The fallback uses getFilteredThreads (fuzzy fzf title search). But the condition checks fullTextResults.length > 0, not whether the index is ready. If the index is built but the search term matches zero content entries, it falls back to fuzzy title search -- which behaves differently than the strict substring matching used by the index for titles. A user searching "xyz" might get a fuzzy false-positive from fzf on the title "xylophone buzz" via the fallback, but not when the index is populated. This inconsistency is confusing.

Suggestion: Use indexReady to decide the branch, not the result count.

2. Concurrency / Race Condition

build() deduplication may skip needed rebuilds

If build() is already running (the guard if (this.buildPromise) return this.buildPromise fires), and a new thread is added or a thread is invalidated during that build, the pending invalidation will not be picked up until a subsequent build is triggered. But the useEffect in SearchDialog only triggers on [open, threads] -- if the dialog is already open and no new thread is created, there is no trigger to re-run build() after the invalidation.

This is mitigated by the fact that indexBuilding flips to false and the debounced search useEffect re-runs, but it only calls index.search(), not index.build(). So stale data from mid-build invalidations may persist until the dialog is closed and reopened.

3. Memory / Performance

No upper bound on corpus size

Each thread's content is capped at 5,000 chars, which is good. But there's no limit on the total number of threads indexed. A power user with 1,000+ threads would have the entire corpus in memory. For most users this is fine, but it would be worth documenting the scaling characteristics or adding a thread count cap.

rebuildIndex() is a no-op

The method rebuildIndex() has a comment saying "just mark index as ready" but does nothing. It appears to be scaffolding for a future optimization (perhaps building a trie or inverted index). If it's not needed now, removing it would reduce confusion.

4. UX Concerns

Snippet only shown for content-only matches

In the search method:

snippet: contentMatch && !titleMatch
  ? extractSnippet(entry.contentText, term)
  : undefined

When matchSource === 'both', no snippet is shown. Users would benefit from seeing the content snippet even when the title also matches, since it provides context about what was discussed. Consider showing the snippet for all content matches.

"No results found" may flash during index build

If the index is still building (indexBuilding === true) and the user types a query that has no title match, the "No results found" empty state will show, even though results may appear once indexing completes. Consider showing a different message (e.g., "Still indexing...") when indexBuilding is true and there are no results yet.

5. Code Quality

  • Good: Clean separation of concerns -- index logic is isolated in search-index.ts, hooks only call lightweight invalidation methods.
  • Good: Promise.allSettled for resilient batched fetching.
  • Good: Debounced search (100ms) to prevent jank.
  • Good: XML/think-tag stripping from message content.
  • Minor: The extractTextFromContent function uses any cast (m.content as any). A proper type narrowing or type guard would be safer.
  • Minor: The comment in search() says "fuzzy matching" for titles but the implementation is strict substring (includes). The comment is misleading.

6. Testing

No tests are included. The search-index.ts module is pure logic with no DOM dependencies -- it is very testable. At minimum, unit tests for extractTextFromContent, extractSnippet, and ThreadSearchIndex.search() should be added. The invalidation and build lifecycle logic is also a good candidate for integration-style tests.


Recommendation: fix needed

The hasPendingWork bug is a functional issue that would cause newly created threads to be unsearchable until all messages are cleared or the app is restarted. The fallback-to-fuzzy inconsistency could also confuse users. These should be fixed before merge. Adding tests for the new search-index module is also strongly recommended.

Nice feature overall -- the architecture (singleton index, lazy build, incremental invalidation) is well thought out and the code is clean.

- hasPendingWork is now a method taking threads so newly created threads
  are detected and indexed without requiring a full invalidation
- build() loops until no pending work remains, so invalidations that
  arrive mid-build are not silently dropped (race condition fix)
- fallback-to-fzf now gates on indexReady rather than result count,
  preventing fuzzy false positives once the index is built
- snippet is shown for all content matches including matchSource 'both'
- 'Still indexing…' empty state prevents 'No results' flash during build
- removed dead rebuildIndex() no-op
- MAX_INDEXED_THREADS=2000 cap with documented scaling characteristics
- extractTextFromContent now uses proper ThreadContent[] type (no `any`)
- add 22 unit tests for search-index module

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@FedeCuci
Copy link
Copy Markdown
Author

FedeCuci commented Jun 7, 2026

Fixes applied

hasPendingWork not detecting new threads (bug)
Changed from a boolean getter to hasPendingWork(threads: Record<string, Thread>). It now compares the eligible thread set against indexed entries, so newly created threads are detected and trigger a build immediately.

Fallback-to-fuzzy inconsistency
The result branch in searchResults now gates on indexReady instead of fullTextResults.length > 0. Once the index is ready, its strict-substring empty set is trusted — no more silent fallback to fuzzy fzf matching that could return false positives.

Mid-build race condition
build() now runs a do…while (hasPendingWork) loop internally. Invalidations or new threads that arrive while a build is in flight are picked up on the next iteration instead of being silently dropped until the dialog is reopened.

No upper bound on corpus size
Added MAX_INDEXED_THREADS = 2000. A shared eligibleThreads() helper (used by both doBuild and hasPendingWork) caps the corpus to the most-recently-updated threads. Both paths use the same eligible set so capped-out threads don't appear perpetually "new" (which would have caused an infinite build loop).

rebuildIndex() no-op removed
Deleted the dead scaffolding method.

Snippet for all content matches
snippet is now produced whenever the term appears in the message body, including when matchSource === 'both'.

"No results" flash during build
Added a distinct "Still indexing…" empty state that shows when indexBuilding is true and there are no results yet. The "No results found" state only renders once building completes.

any cast removed
extractTextFromContent now takes ThreadContent[] | undefined and matches on ContentType.Text from @janhq/core. buildEntryForThread passes m.content directly without casting.

Misleading "fuzzy matching" comment
Updated the search() doc comment to say strict substring matching for both titles and content.


Tests added

Added src/lib/__tests__/search-index.test.ts with 22 unit tests covering:

  • extractTextFromContent (empty input, joining parts, stripping think blocks, non-text parts)
  • extractSnippet (absent term, case-insensitivity, ellipsis trimming)
  • ThreadSearchIndex.search() (title-only, content-only, both, strict substring, sort order, pre-build empty)
  • hasPendingWork lifecycle (before build, after build, new thread regression, stale/eviction, full invalidate)
  • Mid-build race condition (invalidation arriving while a fetch is in flight)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant