Improve browser search with BM25 scoring and field boosting#440
Improve browser search with BM25 scoring and field boosting#440kevinschaper wants to merge 2 commits intomainfrom
Conversation
Replace boolean-only search with MiniSearch (BM25+) for relevance-ranked results. Name matches are now heavily boosted via field weights and a boostDocument callback for prefix/exact matching. Adds 31 search ranking tests using Node's built-in test runner, integrated into `just test` and GitHub Actions CI. Closes #439 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @kevinschaper's task in 2m 5s —— View job PR Review: Improve browser search with BM25 scoring and field boosting
This is a well-motivated and cleanly implemented improvement. The BM25+ scoring via MiniSearch significantly improves search relevance, and the Issues1. Version mismatch between CDN and npm (minor bug)
<script src="https://cdn.jsdelivr.net/npm/minisearch@7.1.1/dist/umd/index.min.js"></script>vs "node_modules/minisearch": { "version": "7.2.0" }Recommend aligning both to the same version. Fix this → 2. Missing Subresource Integrity (SRI) hash on CDN script (security hygiene) The CDN script tag has no <script src="https://cdn.jsdelivr.net/npm/minisearch@7.1.1/dist/umd/index.min.js"></script>Consider adding 3. The tokenizer at tokenize: (text) => text.toLowerCase().split(/[\s\-_/,;:()]+/).filter(t => t.length > 1),Querying 4.
This is fine for CI (Node.js is installed before Observations
The function extracts JSON from const jsonStr = dataCode
.replace(/^window\.searchData\s*=\s*/, '')
.replace(/;\s*window\.dispatchEvent\(.*\)\s*;?\s*$/, '')
.trim();This depends on the exact format of
Strengths
Overall this is a solid improvement to the browser UX and is ready to merge after resolving the version mismatch. |
|
Thanks @kevinschaper ! |
|
@dragon-ai-agent please make an issue in linkml/linkml-browser, to see if we want to propagate this back up. Also analyze implications for both speed and size of file footprints distributed via github pages |
Follow-up: upstream linkml-browser issue + footprint/speed analysisResponding to @cmungall's request to file an upstream issue and analyse implications. Why I couldn't file the issue directlyThe Draft issue for
|
| Aspect | Detail |
|---|---|
| Index build | MiniSearch addAll() on ≤ 200 records: < 5 ms. Comparable to current manual tokenisation loop. |
| Search latency | BM25+ scoring over ≤ 200 records: < 1 ms per keystroke — imperceptible. |
| Large KBs (1k–10k records) | MiniSearch's inverted index scales well; CDN approach remains viable. Benchmark before switching to bundled if needed. |
| Initial load | MiniSearch is fetched in parallel with other assets. With HTTP/2 this adds negligible latency to pages that already load data.js. |
File-footprint implications for GitHub Pages deployments
| Asset | Before | After |
|---|---|---|
index.html (template) |
Varies per KB (~31 KB in dismech) | +~2 KB of JS wiring logic |
| CDN script tag | — | Adds minisearch@7.1.1/dist/umd/index.min.js: ~23 KB raw, ~6 KB gzipped from jsDelivr (not served from GitHub Pages) |
data.js / data.json |
Dominant cost (e.g., ~987 KB for dismech's 55 entries) | Unchanged — MiniSearch builds its index in-browser at runtime |
schema.js |
~3 KB | + ~300 bytes for fieldBoosts object |
Net effect on GitHub Pages bandwidth: the only change is ~6 KB (gzipped) transferred from jsDelivr on first load, cached by the browser thereafter. The distributed repository payload is unchanged. The CDN resource is covered by jsDelivr's SLA and cache headers — not GitHub's bandwidth.
Trade-off to consider: CDN dependency vs. self-hosted bundle. Options:
- CDN (current dismech approach): Zero increase in repo/GitHub-Pages payload; requires network access; CDN version pinned by semver tag.
- Vendored in
src/linkml_browser/static/: Increases template repo by ~23 KB; full offline support; no external dependency. Recommended if linkml-browser is used in air-gapped environments. - npm bundled into
index.html: Inlines MiniSearch at build time; largest single-file payload increase (~23 KB) but maximum portability.
Acceptance criteria
- Searching a concept's exact name returns it as the top result.
- A
Relevancesort option is auto-selected when a query is active. -
fieldBoostsare configurable per-schema in the generatedschema.json. - Existing facet filters continue to work alongside MiniSearch results.
- At least one search-ranking test is added (can use the dismech test suite as a template).
- CDN vs. bundle strategy decision is documented.
Summary of implications (dismech-specific)
For this PR specifically (dismech's GitHub Pages deployment at current scale of ~55 disorders):
- Speed: The MiniSearch BM25+ index builds in < 2 ms and queries in < 1 ms. This is a strict improvement over the O(records × tokens) substring scan.
- Payload: The only new bytes transferred are the ~6 KB (gzipped) MiniSearch CDN resource, cached after first visit. The
data.js(~987 KB) andindex.html(~31 KB) are essentially unchanged in size. - Scalability: The approach handles the KB growing to thousands of entries without further changes.
- Reliability: The CDN dependency (jsDelivr with version pin) is low-risk; search gracefully degrades to no results if the CDN is unreachable (same UX as page load failure). Vendoring is trivial if offline support is needed.
Use MiniSearch's built-in match data to show which fields matched and highlight matching terms in descriptions and tags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @kevinschaper's task in 2m 28s —— View job PR Review: Improve browser search with BM25 scoring and field boosting
This PR successfully delivers relevance-ranked search to the disorder browser. The new commit (search result highlighting) is a solid UX improvement that makes it immediately clear why a result was returned. Below is an assessment across the requested dimensions. Biological and clinical validity of search designThe field boost weights in
This is appropriate. A query for "BRCA" ranking a BRCA-named disorder above disorders that merely mention BRCA in their gene list correctly matches clinical intuition. The Issues1. CDN version mismatch — still unresolved from the previous review
<script src="https://cdn.jsdelivr.net/npm/minisearch@7.1.1/dist/umd/index.min.js"></script>The 31 search ranking tests (including the new highlighting logic) run against 7.2.0; the browser serves 7.1.1. This should be aligned before merging. Fix this → 2. Minor display artifact in
In practice, biomedical searches are unlikely to produce these tokens, but it is worth noting. A more robust approach would apply the regex to the original unescaped text, then escape the segments individually. 3. No test coverage for the new highlighting feature The Observations
Tokenizer
CI path filter is correct — Strengths of the new highlighting commit
SummaryThe core BM25+ search improvement and the new highlighting feature are functionally solid and clinically well-motivated. The one unresolved issue from the prior review — the CDN/npm version mismatch (7.1.1 vs 7.2.0) — should be fixed before merging. Everything else is minor. Once that version pin is aligned, this is ready to merge. |
Improve browser search with BM25 scoring and field boosting
Problem
Searching "Parkinson" in the disorder browser returned Parkinson's Disease as the ~5th result. The existing search was boolean-only (match/no match) with alphabetical sorting — no relevance scoring, no field weighting, and no prefix match advantage.
Solution
Replace the custom search implementation with MiniSearch (~6KB gzipped), a client-side full-text search library that provides BM25+ scoring out of the box.
Search improvements:
namematches are weighted 10x,genes/subtypes4-5x,description3x, other fields 1-2x (configurable inschema.js)boostDocumentcallback — exact name matches get 50x, starts-with gets 10x, contains gets 3xResult: "Parkinson" → Parkinson's Disease is now always the #1 result.
Search ranking tests
Added 31 tests (9 suites) using Node's built-in
node:testrunner that exercise the exact same MiniSearch configuration against the realapp/data.jsdataset:CI integration
just testnow includestest-searchalongside the existing Python test suitenpm ciwhenapp/,tests/js/, orpackage.jsonchangesFiles changed
app/index.htmlapp/schema.jsfieldBoostsconfigurationtests/js/search_ranking.test.mjspackage.json/package-lock.jsonjustfiletest-searchtotestrecipeproject.justfiletest-searchrecipe.github/workflows/main.yamlapppath filter, expanded test trigger.gitignorenode_modules/