Open
Conversation
- Update OfferSearchServiceProvider to read word_breaker flag from config - Pass flag through OfferSearchControllerFactory to query builder - Enables dynamic switching between standard and N-Gram tokenized search fields Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add analyzer_decompounder.json with ngram tokenizer (min_gram: 3, max_gram: 20) - Create CreateDecompounderAnalyzer operation class to register analyzer template - N-Gram tokenizer generates all overlapping substrings, enabling partial word matching - Supports compound word searching (e.g., 'begraaf' finds 'parkbegraafplaats') Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Create CreateDecompounderAnalyzerCommand to initialize analyzer template - Register 'decompounder-analyzer:create' command in CommandServiceProvider - Command can be invoked via: php bin/app.php decompounder-analyzer:create Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add .decompounder field variants to all text fields in event mapping
- Add .decompounder field variants to all text fields in place mapping
- Fields: name.{lang}, description.{lang}, address.*.{field}, location.name.{lang}, organizer.name.{lang}
- Decompounder variants use N-Gram tokenizer for compound word matching
- Standard fields remain unchanged for backward compatibility
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Update OfferPredefinedQueryStringFields to accept useWordBreaker parameter - When enabled, returns .decompounder field variants instead of standard fields - Update ElasticSearchOfferQueryBuilder constructor to accept and pass useWordBreaker - Update OfferSearchControllerFactory to inject word_breaker flag from config - Enables dynamic field selection based on feature flag Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add volume mount for hyphenation patterns directory to Elasticsearch container - Mounts ./docker/elasticsearch/config/analysis/hyphenation_patterns to container path - Allows Elasticsearch to access pattern files and wordlists for decompounder analyzer - Required for N-Gram analyzer template initialization Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added
This PR implements a decompounder feature for the
/offerssearch endpoint that allows users to search for parts of compound words and still find relevant results.it uses a feature flag called "word_breaker", to toggle behavior, without requiring an reindex every time.
HOWEVER: This will need a reindex on deployment the first time.
Why N-Gram Instead of Hyphenation Decompounder?
Initially, the ticket said to use
hyphenation_decompoundertoken filter.However, this filter does not function reliably in ES 5.3.3, or at least, I couldn't get it to work.
I wil make another PR with the code I had for this filter, feel free to look at it, maybe somebody else can fix it.
The strange thing is that ES recognises the config for the hyphenation_decompounder, and throws error when invalid options are presented. They just don't seem to do anything.
However, I cannot find the documention for this decompounder in the docs for this version: https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-tokenizers.html
Both ChatGPT and Claude tell me this is a known Lucene 6.x limitation (which ES 5.3.3 uses) that was fixed in ES 7.10+.
Downside of using N-Gram
Dowsize of the ngram strategy is that it generate a lot more tokens because it does not understand language. This will increase RAM and disk space usage of ES.
What's next?
I suggest we try on test on monitor how much large the indexes grow, and then decide if we want to this or just implement this feature after the ES upgrade.
Ticket: https://jira.uitdatabank.be/browse/III-5123