Skip to content

Wildcard queries can work incorrectly with char_filters #84191

Open
@romseygeek

Description

@romseygeek

Elasticsearch Version

8.0.0

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

Running query_string queries containing wildcards against a text field configured with a char_filter that may introduce extra tokens to analysis ('-' -> ' ' for example) can miss results. This is because the part of QueryStringQueryParser that handles wildcard queries only applies normalization to its input, rather than full analysis. So given the input foo-b*r, the wildcard query path will apply the char_filter to create foo b*r, and then create a wildcard query on that term - but at index time the token foo-bar will have been split into two tokens, so no match will be found.

Note that prefix queries do apply full analysis and so a query for foo-ba* would correctly match the input.

Steps to Reproduce

DELETE test_index

PUT test_index
{
  "mappings": {
    "properties" : {
      "title" : {
        "type" : "text",
        "analyzer" : "filtered"
      }
    }
  },
  "settings" : {
    "index" : {
      "analysis": {
        "char_filter" : {
          "hyphens" : {
            "type": "pattern_replace",
	    "pattern": "([a-zA-Z])-([a-zA-Z])",
	    "replacement": "$1 $2"
          }
        },
        "analyzer": {
          "filtered" : {
            "char_filter" : [ "hyphens" ],
            "filter" : [
              "lowercase"
            ],
            "tokenizer" : "standard"
          }
        }
      }
    }
  }
}

PUT test_index/_doc/1
{
  "title": "foo-bar"
}

GET test_index/_validate/query?explain=true
{
  "query": {
    "query_string" : {
      "fields": [ "title" ],
      "query": "foo-b*r"
    }
  }
}

The resulting query is a wildcard query against the term foo b*r.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Search Relevance/AnalysisHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in Elasticsearchpriority:normalA label for assessing bug priority to be used by ES engineersv8.0.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions