Skip to content

✨🌐 Added language-specific handling to search for en, fr, de #23122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

cathysarisky
Copy link
Member

@cathysarisky cathysarisky commented May 1, 2025

no issue

  • With the update to Flexsearch 0.8, we have new presets!
  • There are stemming presets for en, fr, and de, that help find words that mean the same thing
  • In English, running is indexed and searched as 'run', and searching for 'run' or 'running' will find both.
  • In French, 'rapidement' is likewise converted to 'rapide'.
  • Other languages could be supported if someone contributes a new preset to flexsearch.
  • Note: The CJK preset is not being used because it is deeply destructive to non-CJK text.
  • Some sites might have mixed content -- the impact of the 'wrong' stemming (except for CJK) looks fairly small.
  • details here: https://github.com/nextapps-de/flexsearch/tree/master/src/lang

Copy link
Contributor

coderabbitai bot commented May 1, 2025

"""

Walkthrough

This change updates the search indexing functionality to support locale-aware text processing. The SearchIndex class constructor now accepts a new locale parameter in addition to the existing adminUrl, apiKey, and dir parameters. The encoder for Flexsearch is no longer statically defined; instead, a new chooseEncoder function dynamically selects an encoder preset based on the provided locale, supporting English, French, and German, with a fallback for other locales using a CJK codepoint preset. The App component in sodo-search is updated to pass the current language as the locale property when creating a SearchIndex instance. Additionally, new tests were added to verify language-specific stemming behavior for English, German, and unsupported locales. No changes were made to the signatures of exported or public entities other than the SearchIndex constructor.

Possibly related PRs


πŸ“œ Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 64948a6 and f5349aa.

πŸ“’ Files selected for processing (3)
  • apps/sodo-search/src/search-index.js (6 hunks)
  • apps/sodo-search/src/search-index.test.js (1 hunks)
  • apps/sodo-search/vite.config.js (1 hunks)
βœ… Files skipped from review due to trivial changes (1)
  • apps/sodo-search/vite.config.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • apps/sodo-search/src/search-index.js
  • apps/sodo-search/src/search-index.test.js
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Unit tests (Node 22.13.1)
  • GitHub Check: Unit tests (Node 20.11.1)
✨ Finishing Touches
  • πŸ“ Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share
πŸͺ§ Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cathysarisky cathysarisky changed the title flexsearch stemmer setup for ci -- again ✨ Added language-specific handling to search (en, fr, de) May 1, 2025
@cathysarisky cathysarisky marked this pull request as ready for review May 1, 2025 03:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/sodo-search/src/search-index.js (1)

57-83: Consider adding documentation for supported languages

Since the PR mentions that Flexsearch currently only supports these three languages for stemming but additional ones could be added, it might be helpful to add a comment documenting the currently supported languages and how to add more in the future.

+// Currently supports locale-specific stemming for English (en), French (fr), and German (de).
+// To add support for additional languages:
+// 1. Import the language preset from flexsearch/lang/<language-code>
+// 2. Add a case for the language code in the switch statement below
 const chooseEncoder = (locale) => {
     switch (locale) {
     case 'en':
πŸ“œ Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between d571068 and 597a4a6.

πŸ“’ Files selected for processing (2)
  • apps/sodo-search/src/App.js (1 hunks)
  • apps/sodo-search/src/search-index.js (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
apps/sodo-search/src/search-index.js (1)
ghost/core/core/frontend/src/ghost-stats/ghost-stats.js (1)
  • locale (145-145)
πŸ”‡ Additional comments (5)
apps/sodo-search/src/App.js (1)

19-20: LGTM: Locale parameter addition for SearchIndex

The locale parameter is now correctly passed to the SearchIndex constructor, enabling language-specific search functionality. This change aligns perfectly with the PR objectives of adding language-specific stemming support.

apps/sodo-search/src/search-index.js (4)

3-5: LGTM: Language preset imports

Appropriate language presets are imported from Flexsearch to support the locale-specific stemming for English, French, and German.


57-83: LGTM: Well-implemented locale-specific encoder selection

The chooseEncoder function provides a clean implementation for selecting language-specific encoders based on the locale parameter. The function correctly handles the three supported languages (English, French, German) with a sensible fallback for other locales.

This implementation enables the behavior described in the PR objectives where, for example, searching for "informational" in English would now return results containing "information" due to language-specific stemming.


85-85: LGTM: Updated constructor signature

The SearchIndex constructor now accepts a locale parameter, which is destructured from the options object. This is consistent with how the other parameters are handled.


101-101: LGTM: Encoder configuration for all indexes

The change correctly applies the locale-specific encoder to all three indexes (posts, authors, tags), ensuring consistent search behavior across all content types.

Also applies to: 111-111, 121-121

export default class SearchIndex {
constructor({adminUrl, apiKey, dir}) {
constructor({adminUrl, apiKey, dir, locale}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Verification agent

🧩 Analysis chain

Verify the locale value format

Ensure that the locale value passed from App.js is compatible with the switch statement in chooseEncoder. Currently, it expects two-letter language codes ('en', 'fr', 'de'), but browser-provided locale strings might include region codes (e.g., 'en-US', 'fr-FR').


🏁 Script executed:

#!/bin/bash
# Check for any locale handling code that might normalize locale values
rg -A 2 -B 2 "locale.*=|=.*locale" --glob "*.js" --glob "*.ts"

Length of output: 45616


I’d like to inspect the chooseEncoder implementation and constructor in apps/sodo-search/src/search-index.js to confirm there’s no locale normalization before the switch:


🏁 Script executed:

#!/bin/bash
# Show the chooseEncoder function and constructor in search-index.js
rg -n "chooseEncoder" -A10 -B2 apps/sodo-search/src/search-index.js

Length of output: 1427


Normalize locale to two-letter codes before choosing an encoder

The locale passed into chooseEncoder comes directly from the script tag (e.g. "en-US") but the switch only matches exact two-letter codes ('en', 'fr', etc.). You’ll need to strip off any region subtags (and lowercase) so you always pass a supported code:

β€’ File: apps/sodo-search/src/search-index.js
– Constructor (lines ~98–102) currently does:
js this.postsIndex = new Flexsearch.Document({ …, encoder: chooseEncoder(locale) });
– chooseEncoder (lines 57–67) only handles exact matches like 'en' and 'fr'.

Suggested change:

   constructor({adminUrl, apiKey, dir, locale}) {
-    const encoder = chooseEncoder(locale);
+    const lang = locale.split(/[-_]/)[0].toLowerCase();
+    const encoder = chooseEncoder(lang);

    this.postsIndex = new Flexsearch.Document({
      …,
      encoder
    });
    // repeat for authorsIndex, tagsIndex

This ensures "en-US", "fr-FR", etc. fall back to your two-letter presets.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
apps/sodo-search/src/search-index.test.js (5)

472-474: Remove extra blank lines

There are excessive blank lines in these sections that violate your ESLint rules.

 await searchIndex.init();

-

-        
         let searchResults = searchIndex.search('running');
 expect(searchResults.posts[0].url).toEqual('http://localhost/ghost/dem-mann/');

-
-        
 });

Also applies to: 486-489

🧰 Tools
πŸͺ› ESLint

[error] 472-474: More than 1 blank line not allowed.

(no-multiple-empty-lines)


539-541: Remove blank lines before closing brace

These extra blank lines violate your ESLint rules.

 expect(searchResults.posts.length).toEqual(2);

-
-
 });
🧰 Tools
πŸͺ› ESLint

[error] 539-541: Block must not be padded by blank lines.

(padded-blocks)


591-593: Remove blank lines before closing brace

These extra blank lines violate your ESLint rules.

 expect(searchResults.posts.length).toEqual(1);

-
-
 });
🧰 Tools
πŸͺ› ESLint

[error] 591-593: Block must not be padded by blank lines.

(padded-blocks)


432-593: Consider adding a French stemming test

The PR mentions support for French (fr) locale, but there's no test that specifically verifies French stemming behavior. Adding a test case for French would provide more complete coverage of the feature.

Consider adding a test similar to the English and German ones but with French-specific stemming examples, such as "informationnel"/"information" or other common French suffixes.

🧰 Tools
πŸͺ› ESLint

[error] 472-474: More than 1 blank line not allowed.

(no-multiple-empty-lines)


[error] 486-489: Block must not be padded by blank lines.

(padded-blocks)


[error] 488-489: More than 1 blank line not allowed.

(no-multiple-empty-lines)


[error] 539-541: Block must not be padded by blank lines.

(padded-blocks)


[error] 591-593: Block must not be padded by blank lines.

(padded-blocks)


432-434: Improve test suite documentation

The comment is helpful but could be more detailed to explain the specific stemming differences being tested.

-    // These tests illustrate differences in stemming between languages en and de.
+    // These tests illustrate differences in language-specific stemming:
+    // 1. In English: suffixes like "-ing" are removed, so "running" and "run" match the same documents
+    // 2. In German: noun case forms like "des Mannes" and "dem Mann" are stemmed to the same root
+    // 3. With unsupported locales: no language-specific stemming occurs
πŸ“œ Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 597a4a6 and 64948a6.

πŸ“’ Files selected for processing (1)
  • apps/sodo-search/src/search-index.test.js (1 hunks)
🧰 Additional context used
πŸͺ› ESLint
apps/sodo-search/src/search-index.test.js

[error] 472-474: More than 1 blank line not allowed.

(no-multiple-empty-lines)


[error] 486-489: Block must not be padded by blank lines.

(padded-blocks)


[error] 488-489: More than 1 blank line not allowed.

(no-multiple-empty-lines)


[error] 539-541: Block must not be padded by blank lines.

(padded-blocks)


[error] 591-593: Block must not be padded by blank lines.

(padded-blocks)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Setup
πŸ”‡ Additional comments (3)
apps/sodo-search/src/search-index.test.js (3)

435-489: Well-structured test for English-specific stemming

This test effectively demonstrates how English stemming treats words differently than other languages, showing how "running" and "run" both return the same results due to English suffix removal, while German words maintain their distinct forms.

🧰 Tools
πŸͺ› ESLint

[error] 472-474: More than 1 blank line not allowed.

(no-multiple-empty-lines)


[error] 486-489: Block must not be padded by blank lines.

(padded-blocks)


[error] 488-489: More than 1 blank line not allowed.

(no-multiple-empty-lines)


490-541: Effective test for German-specific stemming

This test nicely demonstrates how German stemming works differently from English, particularly with noun case forms like "des Mannes" and "dem Mann" which are correctly stemmed to the same root in German locale but not in English.

🧰 Tools
πŸͺ› ESLint

[error] 539-541: Block must not be padded by blank lines.

(padded-blocks)


542-593: Good fallback test for unsupported locales

This test properly verifies the fallback behavior when an unsupported locale is specified, showing that custom stemming doesn't occur in this case.

🧰 Tools
πŸͺ› ESLint

[error] 591-593: Block must not be padded by blank lines.

(padded-blocks)

@cathysarisky cathysarisky marked this pull request as draft May 1, 2025 17:22
@cathysarisky cathysarisky marked this pull request as ready for review May 1, 2025 22:39
@cathysarisky cathysarisky changed the title ✨ Added language-specific handling to search (en, fr, de) ✨🌐 Added language-specific handling to search for en, fr, de (#23122) May 1, 2025
@cathysarisky cathysarisky changed the title ✨🌐 Added language-specific handling to search for en, fr, de (#23122) ✨🌐 Added language-specific handling to search for en, fr, de May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant