AI Search: add suggested topics and keyword suggestions#6180
AI Search: add suggested topics and keyword suggestions#6180
Conversation
Add "Browse topic" and "Try also" keyword suggestions to the search page. Topics are fetched via semantic search (Vectorize) from /api/ai-search/topics. Keywords are generated via vocabulary-based LLM rewriting from /api/ai-search/rewrite. - SearchSuggestedTopics: shows related topic pills below the search bar - SearchSuggestedKeywords: shows keyword suggestions for query refinement - Topics and rewrite API endpoints (Cloudflare Workers) - Vectorize indexing scripts for topics and vocabulary - Baker script for generating topics.json from database - AI and Vectorize bindings in wrangler config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Quick links (staging server):
Login:
Archive:SVG tester:Number of differences (graphers): 0 ✅ Edited: 2026-03-04 09:15:22 UTC |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 376bf5ec72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Parse query parameter | ||
| const query = url.searchParams.get(COMMON_SEARCH_PARAMS.QUERY) || "" | ||
|
|
There was a problem hiding this comment.
Reject empty queries before topic embedding search
This endpoint accepts missing/empty q and still proceeds to generate embeddings and query Vectorize, so requests like /api/ai-search/topics trigger paid AI work for effectively non-specific input. Because this route is public, that can create avoidable cost/load and noisy results; it should return a 400 for empty queries (as the rewrite endpoint already does).
Useful? React with 👍 / 👎.
| const limit = Math.min( | ||
| Math.max( | ||
| 1, | ||
| parseInt( | ||
| url.searchParams.get("limit") || DEFAULT_LIMIT.toString() | ||
| ) |
There was a problem hiding this comment.
Validate numeric limit before passing it to Vectorize
If limit is non-numeric (for example ?limit=abc), parseInt(...) yields NaN and the current Math.max/Math.min clamp preserves NaN; this then flows into topK for VECTORIZE_TOPICS.query(...), which can fail at runtime and return a 500 instead of a client error. Guarding with Number.isFinite (or defaulting to DEFAULT_LIMIT) would prevent request-triggered failures.
Useful? React with 👍 / 👎.
| const limit = Math.min( | ||
| Math.max( | ||
| 1, | ||
| parseInt( | ||
| url.searchParams.get("limit") || DEFAULT_LIMIT.toString() | ||
| ) |
There was a problem hiding this comment.
Sanitize rewrite limit to avoid NaN-driven keyword output
The same parseInt + clamp pattern allows limit=abc to become NaN, so this handler can send malformed constraints (e.g. 1-NaN) into LLM prompts and then slice with NaN, producing empty or unstable keyword responses while still returning 200. Rejecting invalid numeric limits (or falling back to the default) keeps behavior deterministic.
Useful? React with 👍 / 👎.
- Rename rewrite/ endpoint to keywords/ with clearer param names (source=semantic|llm, llm_filter=true|false) - Align topics and keywords endpoints to use same source param values - Add LLM filtering to topics endpoint (default on) to improve relevance - Replace static topics.json with cached Datasette fetch on cold start - Fetch vocabulary data from URL instead of local JSON file - Lowercase keyword suggestions in frontend Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Topics are now fetched from Datasette at runtime (cached in worker), so the R2 upload script and its utility are no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No longer needed since topics.json was replaced with runtime Datasette fetch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| JSON.stringify({ | ||
| error: "Failed to index topics", | ||
| details: error instanceof Error ? error.message : String(error), | ||
| }), |
Check warning
Code scanning / CodeQL
Information exposure through a stack trace Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 11 days ago
In general, to fix this issue we should avoid including raw exception data (messages, stack traces, stringified errors) in responses sent to clients. Instead, log the detailed error information on the server and return a generic, user-safe error message and possibly a non-sensitive error code or correlation ID.
For this specific code, the best minimal fix without changing functionality is:
- Keep
console.error("Error indexing topics:", error)so developers still see full error details in logs. - Change the HTTP response body in the
catchblock to no longer includeerror.messageorString(error). - Replace the
detailsfield with a generic message (or remove it) that does not depend on the caught error, ensuring no stack-trace–related information flows to the client.
Concretely, in functions/scripts/indexTopics.ts, lines 226–229 should be updated so that details is either removed or set to a static string. No new imports or helpers are required.
| @@ -225,7 +225,7 @@ | ||
| return new Response( | ||
| JSON.stringify({ | ||
| error: "Failed to index topics", | ||
| details: error instanceof Error ? error.message : String(error), | ||
| details: "An internal error occurred while processing the request.", | ||
| }), | ||
| { | ||
| status: 500, |
| JSON.stringify({ | ||
| error: "Failed to index vocabulary", | ||
| details: error instanceof Error ? error.message : String(error), | ||
| }), |
Check warning
Code scanning / CodeQL
Information exposure through a stack trace Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 10 days ago
To fix the problem, the response body for error cases should avoid including information derived directly from the caught error. Instead, send a generic, user-safe error message, and keep the detailed information only in server-side logs (which the code is already writing via console.error). The core behavior of the endpoint (indexing vocabulary and returning success information) should remain unchanged; only the shape of the error response needs adjustment.
Concretely, in functions/scripts/indexVocabulary.ts, within handleRequest’s catch block (lines 247–259), remove the details field that exposes error.message / String(error) to the client. Optionally, replace it with a constant generic hint such as "Try again later or contact support." that does not depend on the caught error. No new imports or helper methods are required; console.error already ensures the detailed error is available in logs.
| @@ -250,7 +250,6 @@ | ||
| return new Response( | ||
| JSON.stringify({ | ||
| error: "Failed to index vocabulary", | ||
| details: error instanceof Error ? error.message : String(error), | ||
| }), | ||
| { | ||
| status: 500, |
When search returns no results, nudge users toward the AI-suggested topic pills shown above the results area. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove topic search endpoints, suggested topics/keywords UI components, indexing scripts, and vocabulary data that were moved to PR #6180. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
improve Resolve focus placeholders when looking up reference queries The resolved/unresolved query-string handling was incorrect when a focus parameter was present: reference lookups used the unresolved query string instead of the resolved one. This caused mismatches when comparing graphs that used focus placeholders. To fix this, added a helper to resolve focus placeholders (resolveQueryStrForLookup) that creates a minimal grapher state to compute the resolved query string, and updated verification to use the resolved query string. Also corrected diff line counting to separately compute added/deleted lines and use the maximum as the changed-lines count. Use unresolved query string for SVG records When resolving focus placeholders required loading grapher data, generating a minimal grapher was complex and unnecessary for lookup purposes. Instead, keep and use the original/unresolved query string for SVG record keys and export lookups. This simplifies rendering and lookup logic by removing resolveQueryStrForLookup and related resolution steps, and ensures chart keys are created from the provided query string directly. Include query string when opening SVG in report When opening an SVG from the HTML report, the query string on the record was not being included in the generated links or displayed title, causing opened charts to miss context provided by URL parameters. This change reads queryStr from the svgRecord and appends it to viewId for full links, uses the combined value for the copy-to-clipboard button, and shows the query string in the section title so the correct URL and label are preserved when viewing or copying. Include resolvedQueryStr in svg records and UI Add resolvedQueryStr to the SvgRecord type and propagate it through utils, parsing, and view generation so the compare view and HTML report can use the resolved query string (with actual entity names) where available. This change makes the displayed and copyable URLs use resolvedQueryStr if present, falling back to the original queryStr, improving clarity when placeholders (e.g. <firstSeries>) have been replaced.
Fixing: Set is not a function or its return value is not iterable
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
/api/ai-search/topicsand/api/ai-search/keywordsCloudflare Workers endpointssource=semantic|llmandllm_filter=true|falsefor tuningTest plan
/api/ai-search/topics?q=climatereturns relevant topics/api/ai-search/keywords?q=how rich are countriesreturns keyword suggestions🤖 Generated with Claude Code