Change tokenization for more coherent search #6764
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #6709
This addresses point 3) of #6743
This is a very small change which requires a longer explanation. As discussed extensively in the linked issue, the Kitodo search for process titles behaves different when searching a title in a search over
allfields vs. a search in theprocess titlefield in the index.The reason for that is, that Kitodo tokenizes the input string different at search time depending on which field is searched. In the
process title searchthe Kitodo search effectively drops all tokens with a length of less then three.So
Heutwia_898482011-1794081501_01-sis effectively searched asHeutwia_898482011-1794081501. This works, because at index time the same happens.The behaviour when searching over all fields however is different. When searching over all fields Kitodo drops no token and searches for
Heutwia_898482011-1794081501_01-s. This is however not what was indexed (onlyHeutwia_898482011-1794081501). As result the user gets no hits when searching this process title over all fields, which is super confusing.The reason for this implementation was that when searching
projectsover all fields, it was considered necessary that "project A" can be differentiated from "project B". Therefor the tokenization for the project terms in theallfield determines the search time tokenization of all other fields which leads to the confusing behavior. (See: #6618 (comment))As we already have a quite sophisticated
projectfield search the @kitodo/kitodo-community-board decided after consulatation with @matthias-ronge, that this special tokenization behaviour is not necessary. We opt for using the same default tokenization rules for all fields indexed in the global search field and to drop the special constant.