Skip to content

AI Search: add suggested topics and keyword suggestions#6180

Open
Marigold wants to merge 18 commits intomasterfrom
ai-search-suggested-topics-keywords
Open

AI Search: add suggested topics and keyword suggestions#6180
Marigold wants to merge 18 commits intomasterfrom
ai-search-suggested-topics-keywords

Conversation

@Marigold
Copy link
Contributor

@Marigold Marigold commented Mar 4, 2026

Summary

  • Add "Browse topic" pills that suggest related topics below the search bar using semantic search via Cloudflare Vectorize, with LLM filtering to improve relevance
  • Add "Try also" keyword suggestions via vocabulary-based semantic search with optional LLM filtering
  • Add /api/ai-search/topics and /api/ai-search/keywords Cloudflare Workers endpoints
  • Both endpoints support source=semantic|llm and llm_filter=true|false for tuning
  • Topics list fetched dynamically from Datasette and cached in worker memory (no static JSON)
  • Vectorize indexing scripts for topics and vocabulary data
  • Update empty state to nudge users toward suggested topics when no results found
  • Configure Workers AI and Vectorize bindings in wrangler

Test plan

  • Deploy to staging and verify topic suggestions appear when typing a search query
  • Verify keyword suggestions appear below topic suggestions
  • Test /api/ai-search/topics?q=climate returns relevant topics
  • Test /api/ai-search/keywords?q=how rich are countries returns keyword suggestions
  • Verify LLM filtering removes irrelevant topics (e.g. "Farm Size" for "how tall are people")
  • Verify clicking a topic pill navigates to the topic page
  • Verify clicking a keyword suggestion updates the search query
  • Test empty state shows "Try exploring a suggested topic above" message

🤖 Generated with Claude Code

Add "Browse topic" and "Try also" keyword suggestions to the search page.
Topics are fetched via semantic search (Vectorize) from /api/ai-search/topics.
Keywords are generated via vocabulary-based LLM rewriting from /api/ai-search/rewrite.

- SearchSuggestedTopics: shows related topic pills below the search bar
- SearchSuggestedKeywords: shows keyword suggestions for query refinement
- Topics and rewrite API endpoints (Cloudflare Workers)
- Vectorize indexing scripts for topics and vocabulary
- Baker script for generating topics.json from database
- AI and Vectorize bindings in wrangler config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@owidbot
Copy link
Contributor

owidbot commented Mar 4, 2026

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-ai-search-suggested-topics-k

Archive:
SVG tester:

Number of differences (graphers): 0 ✅
Number of differences (grapher views): skipped
Number of differences (mdims): skipped
Number of differences (explorers): skipped

Edited: 2026-03-04 09:15:22 UTC
Execution time: 1.31 seconds

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 376bf5ec72

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +208 to +210
// Parse query parameter
const query = url.searchParams.get(COMMON_SEARCH_PARAMS.QUERY) || ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty queries before topic embedding search

This endpoint accepts missing/empty q and still proceeds to generate embeddings and query Vectorize, so requests like /api/ai-search/topics trigger paid AI work for effectively non-specific input. Because this route is public, that can create avoidable cost/load and noisy results; it should return a 400 for empty queries (as the rewrite endpoint already does).

Useful? React with 👍 / 👎.

Comment on lines +232 to +237
const limit = Math.min(
Math.max(
1,
parseInt(
url.searchParams.get("limit") || DEFAULT_LIMIT.toString()
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate numeric limit before passing it to Vectorize

If limit is non-numeric (for example ?limit=abc), parseInt(...) yields NaN and the current Math.max/Math.min clamp preserves NaN; this then flows into topK for VECTORIZE_TOPICS.query(...), which can fail at runtime and return a 500 instead of a client error. Guarding with Number.isFinite (or defaulting to DEFAULT_LIMIT) would prevent request-triggered failures.

Useful? React with 👍 / 👎.

Comment on lines +245 to +250
const limit = Math.min(
Math.max(
1,
parseInt(
url.searchParams.get("limit") || DEFAULT_LIMIT.toString()
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sanitize rewrite limit to avoid NaN-driven keyword output

The same parseInt + clamp pattern allows limit=abc to become NaN, so this handler can send malformed constraints (e.g. 1-NaN) into LLM prompts and then slice with NaN, producing empty or unstable keyword responses while still returning 200. Rejecting invalid numeric limits (or falling back to the default) keeps behavior deterministic.

Useful? React with 👍 / 👎.

Marigold and others added 4 commits March 4, 2026 11:16
- Rename rewrite/ endpoint to keywords/ with clearer param names
  (source=semantic|llm, llm_filter=true|false)
- Align topics and keywords endpoints to use same source param values
- Add LLM filtering to topics endpoint (default on) to improve relevance
- Replace static topics.json with cached Datasette fetch on cold start
- Fetch vocabulary data from URL instead of local JSON file
- Lowercase keyword suggestions in frontend

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Topics are now fetched from Datasette at runtime (cached in worker),
so the R2 upload script and its utility are no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No longer needed since topics.json was replaced with runtime Datasette fetch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +226 to +229
JSON.stringify({
error: "Failed to index topics",
details: error instanceof Error ? error.message : String(error),
}),

Check warning

Code scanning / CodeQL

Information exposure through a stack trace Medium

This information exposed to the user depends on
stack trace information
.

Copilot Autofix

AI 11 days ago

In general, to fix this issue we should avoid including raw exception data (messages, stack traces, stringified errors) in responses sent to clients. Instead, log the detailed error information on the server and return a generic, user-safe error message and possibly a non-sensitive error code or correlation ID.

For this specific code, the best minimal fix without changing functionality is:

  • Keep console.error("Error indexing topics:", error) so developers still see full error details in logs.
  • Change the HTTP response body in the catch block to no longer include error.message or String(error).
  • Replace the details field with a generic message (or remove it) that does not depend on the caught error, ensuring no stack-trace–related information flows to the client.

Concretely, in functions/scripts/indexTopics.ts, lines 226–229 should be updated so that details is either removed or set to a static string. No new imports or helpers are required.

Suggested changeset 1
functions/scripts/indexTopics.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/functions/scripts/indexTopics.ts b/functions/scripts/indexTopics.ts
--- a/functions/scripts/indexTopics.ts
+++ b/functions/scripts/indexTopics.ts
@@ -225,7 +225,7 @@
         return new Response(
             JSON.stringify({
                 error: "Failed to index topics",
-                details: error instanceof Error ? error.message : String(error),
+                details: "An internal error occurred while processing the request.",
             }),
             {
                 status: 500,
EOF
@@ -225,7 +225,7 @@
return new Response(
JSON.stringify({
error: "Failed to index topics",
details: error instanceof Error ? error.message : String(error),
details: "An internal error occurred while processing the request.",
}),
{
status: 500,
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +258 to +261
JSON.stringify({
error: "Failed to index vocabulary",
details: error instanceof Error ? error.message : String(error),
}),

Check warning

Code scanning / CodeQL

Information exposure through a stack trace Medium

This information exposed to the user depends on
stack trace information
.

Copilot Autofix

AI 10 days ago

To fix the problem, the response body for error cases should avoid including information derived directly from the caught error. Instead, send a generic, user-safe error message, and keep the detailed information only in server-side logs (which the code is already writing via console.error). The core behavior of the endpoint (indexing vocabulary and returning success information) should remain unchanged; only the shape of the error response needs adjustment.

Concretely, in functions/scripts/indexVocabulary.ts, within handleRequest’s catch block (lines 247–259), remove the details field that exposes error.message / String(error) to the client. Optionally, replace it with a constant generic hint such as "Try again later or contact support." that does not depend on the caught error. No new imports or helper methods are required; console.error already ensures the detailed error is available in logs.

Suggested changeset 1
functions/scripts/indexVocabulary.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/functions/scripts/indexVocabulary.ts b/functions/scripts/indexVocabulary.ts
--- a/functions/scripts/indexVocabulary.ts
+++ b/functions/scripts/indexVocabulary.ts
@@ -250,7 +250,6 @@
         return new Response(
             JSON.stringify({
                 error: "Failed to index vocabulary",
-                details: error instanceof Error ? error.message : String(error),
             }),
             {
                 status: 500,
EOF
@@ -250,7 +250,6 @@
return new Response(
JSON.stringify({
error: "Failed to index vocabulary",
details: error instanceof Error ? error.message : String(error),
}),
{
status: 500,
Copilot is powered by AI and may make mistakes. Always verify output.
Marigold and others added 2 commits March 4, 2026 11:26
When search returns no results, nudge users toward the AI-suggested
topic pills shown above the results area.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Marigold added a commit that referenced this pull request Mar 4, 2026
Remove topic search endpoints, suggested topics/keywords UI components,
indexing scripts, and vocabulary data that were moved to PR #6180.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sophiamersmann and others added 10 commits March 4, 2026 15:29
improve
Resolve focus placeholders when looking up reference queries

The resolved/unresolved query-string handling was incorrect when a focus parameter was present: reference lookups used the unresolved query string instead of the resolved one. This caused mismatches when comparing graphs that used focus placeholders. To fix this, added a helper to resolve focus placeholders (resolveQueryStrForLookup) that creates a minimal grapher state to compute the resolved query string, and updated verification to use the resolved query string. Also corrected diff line counting to separately compute added/deleted lines and use the maximum as the changed-lines count.
Use unresolved query string for SVG records

When resolving focus placeholders required loading grapher data, generating a minimal grapher was complex and unnecessary for lookup purposes. Instead, keep and use the original/unresolved query string for SVG record keys and export lookups. This simplifies rendering and lookup logic by removing resolveQueryStrForLookup and related resolution steps, and ensures chart keys are created from the provided query string directly.
Include query string when opening SVG in report

When opening an SVG from the HTML report, the query string on the record was not being included in the generated links or displayed title, causing opened charts to miss context provided by URL parameters. This change reads queryStr from the svgRecord and appends it to viewId for full links, uses the combined value for the copy-to-clipboard button, and shows the query string in the section title so the correct URL and label are preserved when viewing or copying.
Include resolvedQueryStr in svg records and UI

Add resolvedQueryStr to the SvgRecord type and propagate it through utils, parsing, and view generation so the compare view and HTML report can use the resolved query string (with actual entity names) where available. This change makes the displayed and copyable URLs use resolvedQueryStr if present, falling back to the original queryStr, improving clarity when placeholders (e.g. <firstSeries>) have been replaced.
Fixing: Set is not a function or its return value is not iterable
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants