Skip to content

Fixes bugs from word delimiter graph filter#184

Open
mboynes wants to merge 1 commit intoAutomattic:developfrom
mboynes:hotfix/word-delimiter-graph-preserve-original
Open

Fixes bugs from word delimiter graph filter#184
mboynes wants to merge 1 commit intoAutomattic:developfrom
mboynes:hotfix/word-delimiter-graph-preserve-original

Conversation

@mboynes
Copy link
Copy Markdown

@mboynes mboynes commented Jan 4, 2023

VIP: This was PRed upstream, though it was closed. I would encourage VIP to adopt this change in your fork regardless.


Description

This removes the preserve_original option for the word_delimiter_graph token filter. As noted in the ES docs:

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

As it was set, this was producing multi-position tokens, which could lead to unexpected results and, potentially, indexing errors. Where I observed this specifically was using a search_as_you_type field. This uses shingle token filters to create 2- and 3-gram sub-fields. Combined with word_delimiter_graph.preserve_original = true, if the field text is a word like "WordPress", and the analyzed token count is <= the gram size, the tokens can end up with a negative token position and indexing fails.

For what it's worth, I tried using flatten_graph as the docs suggest as a workaround, and that didn't work 🤷.

Checklist

Please make sure the items below have been covered before requesting a review:

  • This change works and has been tested locally (or has an appropriate fallback).
  • This change works and has been tested on a Go sandbox.
  • This change has relevant unit tests (if applicable).
  • This change has relevant documentation additions / updates (if applicable).
  • This change has the fix PRed upstream (if applicable). If not applicable, it has the relevant "// VIP: reason for the discrepancy with upstream" comment in places where the code is discrepant.

Steps to Test

To replicate the error in isolation:

PUT sayt-demo
{
  "settings": {
    "analysis": {
      "filter": {
        "word_delimiter": {
          "type": "word_delimiter_graph",
          "preserve_original": true
        }
      },
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": [
            "word_delimiter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "test": {
        "type": "search_as_you_type"
      }
    }
  }
}

PUT sayt-demo/_doc/1?refresh
{
  "test": "WordPress"
}

RESPONSE:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
  },
  "status" : 400
}

GET /sayt-demo/_analyze
{
  "field": "test._2gram",
  "text": "WordPress"
}

RESPONSE:
{
  "tokens" : [
    {
      "token" : "Word Press",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "shingle",
      "position" : -1
    }
  ]
}

Observe that when removing "preserve_original": true, the document indexes as expected and the position is correctly calculated as 0, not -1.

@rebeccahum
Copy link
Copy Markdown

If I'm understanding the discussion on the linked PR, looks like a full solution hasn't been merged upstream yet.

@mboynes
Copy link
Copy Markdown
Author

mboynes commented Jan 18, 2023

@rebeccahum correct. And it's unclear as of yet if the likely response (adding default_search analyzers separate from the default indexing analyzers) will actually address the issue, or just work around it for the only "out-of-the-box" feature a user can enable, synonyms. Separating search and index analyzers still requires changing this setting for indexing operations. At the risk of being a broken record, the current configuration is invalid, and creates a problem regardless of whether or not something else (such as search-as-you-type, synonyms, highlighting, etc.) surfaces it.

@rebeccahum rebeccahum force-pushed the develop branch 9 times, most recently from dc6742b to 5c91c11 Compare May 9, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants