Skip to content

III-5123 Word breaker ngram#409

Open
grubolsch wants to merge 7 commits intomainfrom
III-5123/word-breaker-ngram
Open

III-5123 Word breaker ngram#409
grubolsch wants to merge 7 commits intomainfrom
III-5123/word-breaker-ngram

Conversation

@grubolsch
Copy link
Contributor

@grubolsch grubolsch commented Feb 13, 2026

Added

This PR implements a decompounder feature for the /offers search endpoint that allows users to search for parts of compound words and still find relevant results.
it uses a feature flag called "word_breaker", to toggle behavior, without requiring an reindex every time.

HOWEVER: This will need a reindex on deployment the first time.

Why N-Gram Instead of Hyphenation Decompounder?

Initially, the ticket said to use hyphenation_decompounder token filter.
However, this filter does not function reliably in ES 5.3.3, or at least, I couldn't get it to work.
I wil make another PR with the code I had for this filter, feel free to look at it, maybe somebody else can fix it.

The strange thing is that ES recognises the config for the hyphenation_decompounder, and throws error when invalid options are presented. They just don't seem to do anything.

However, I cannot find the documention for this decompounder in the docs for this version: https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-tokenizers.html

Both ChatGPT and Claude tell me this is a known Lucene 6.x limitation (which ES 5.3.3 uses) that was fixed in ES 7.10+.

Downside of using N-Gram

Dowsize of the ngram strategy is that it generate a lot more tokens because it does not understand language. This will increase RAM and disk space usage of ES.

What's next?

I suggest we try on test on monitor how much large the indexes grow, and then decide if we want to this or just implement this feature after the ES upgrade.


Ticket: https://jira.uitdatabank.be/browse/III-5123

Koen Eelen and others added 6 commits February 13, 2026 15:47
- Update OfferSearchServiceProvider to read word_breaker flag from config
- Pass flag through OfferSearchControllerFactory to query builder
- Enables dynamic switching between standard and N-Gram tokenized search fields

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add analyzer_decompounder.json with ngram tokenizer (min_gram: 3, max_gram: 20)
- Create CreateDecompounderAnalyzer operation class to register analyzer template
- N-Gram tokenizer generates all overlapping substrings, enabling partial word matching
- Supports compound word searching (e.g., 'begraaf' finds 'parkbegraafplaats')

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Create CreateDecompounderAnalyzerCommand to initialize analyzer template
- Register 'decompounder-analyzer:create' command in CommandServiceProvider
- Command can be invoked via: php bin/app.php decompounder-analyzer:create

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add .decompounder field variants to all text fields in event mapping
- Add .decompounder field variants to all text fields in place mapping
- Fields: name.{lang}, description.{lang}, address.*.{field}, location.name.{lang}, organizer.name.{lang}
- Decompounder variants use N-Gram tokenizer for compound word matching
- Standard fields remain unchanged for backward compatibility

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Update OfferPredefinedQueryStringFields to accept useWordBreaker parameter
- When enabled, returns .decompounder field variants instead of standard fields
- Update ElasticSearchOfferQueryBuilder constructor to accept and pass useWordBreaker
- Update OfferSearchControllerFactory to inject word_breaker flag from config
- Enables dynamic field selection based on feature flag

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add volume mount for hyphenation patterns directory to Elasticsearch container
- Mounts ./docker/elasticsearch/config/analysis/hyphenation_patterns to container path
- Allows Elasticsearch to access pattern files and wordlists for decompounder analyzer
- Required for N-Gram analyzer template initialization

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@grubolsch grubolsch marked this pull request as ready for review February 13, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant