III-5123 Word breaker ngram by grubolsch · Pull Request #409 · cultuurnet/udb3-search-service

grubolsch · 2026-02-13T15:02:13Z

Added

This PR implements a decompounder feature for the /offers search endpoint that allows users to search for parts of compound words and still find relevant results.
it uses a feature flag called "word_breaker", to toggle behavior, without requiring an reindex every time.

HOWEVER: This will need a reindex on deployment the first time.

Why N-Gram Instead of Hyphenation Decompounder?

Initially, the ticket said to use hyphenation_decompounder token filter.
However, this filter does not function reliably in ES 5.3.3, or at least, I couldn't get it to work.
I wil make another PR with the code I had for this filter, feel free to look at it, maybe somebody else can fix it.

The strange thing is that ES recognises the config for the hyphenation_decompounder, and throws error when invalid options are presented. They just don't seem to do anything.

However, I cannot find the documention for this decompounder in the docs for this version: https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-tokenizers.html

Both ChatGPT and Claude tell me this is a known Lucene 6.x limitation (which ES 5.3.3 uses) that was fixed in ES 7.10+.

Downside of using N-Gram

Dowsize of the ngram strategy is that it generate a lot more tokens because it does not understand language. This will increase RAM and disk space usage of ES.

What's next?

I suggest we try on test on monitor how much large the indexes grow, and then decide if we want to this or just implement this feature after the ES upgrade.

Ticket: https://jira.uitdatabank.be/browse/III-5123

- Update OfferSearchServiceProvider to read word_breaker flag from config - Pass flag through OfferSearchControllerFactory to query builder - Enables dynamic switching between standard and N-Gram tokenized search fields Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Add analyzer_decompounder.json with ngram tokenizer (min_gram: 3, max_gram: 20) - Create CreateDecompounderAnalyzer operation class to register analyzer template - N-Gram tokenizer generates all overlapping substrings, enabling partial word matching - Supports compound word searching (e.g., 'begraaf' finds 'parkbegraafplaats') Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Create CreateDecompounderAnalyzerCommand to initialize analyzer template - Register 'decompounder-analyzer:create' command in CommandServiceProvider - Command can be invoked via: php bin/app.php decompounder-analyzer:create Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Add .decompounder field variants to all text fields in event mapping - Add .decompounder field variants to all text fields in place mapping - Fields: name.{lang}, description.{lang}, address.*.{field}, location.name.{lang}, organizer.name.{lang} - Decompounder variants use N-Gram tokenizer for compound word matching - Standard fields remain unchanged for backward compatibility Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Update OfferPredefinedQueryStringFields to accept useWordBreaker parameter - When enabled, returns .decompounder field variants instead of standard fields - Update ElasticSearchOfferQueryBuilder constructor to accept and pass useWordBreaker - Update OfferSearchControllerFactory to inject word_breaker flag from config - Enables dynamic field selection based on feature flag Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Add volume mount for hyphenation patterns directory to Elasticsearch container - Mounts ./docker/elasticsearch/config/analysis/hyphenation_patterns to container path - Allows Elasticsearch to access pattern files and wordlists for decompounder analyzer - Required for N-Gram analyzer template initialization Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Koen Eelen and others added 6 commits February 13, 2026 15:47

grubolsch marked this pull request as ready for review February 13, 2026 15:15

grubolsch requested review from JonasVHG, bertramakers and lucwollants as code owners February 13, 2026 15:15

Small fixed

667a926

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

III-5123 Word breaker ngram#409

III-5123 Word breaker ngram#409
grubolsch wants to merge 7 commits intomainfrom
III-5123/word-breaker-ngram

grubolsch commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grubolsch commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Why N-Gram Instead of Hyphenation Decompounder?

Downside of using N-Gram

What's next?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

grubolsch commented Feb 13, 2026 •

edited

Loading