Skip to content

[BUG] Missing support for simple_pattern_split and simple_pattern tokenizers #1444

Closed
@mcb-sprout

Description

@mcb-sprout

What is the bug?

The client throws an exception when attempting to parse an index which settings include a simple_pattern_split or simple_pattern tokenizer.

IndexSettings cannot be deserialized from settings using either of these tokenizers preventing them from being used in a CreateIndexRequest. Using the client to make a GetIndexRequest for an index using these settings throws the same exception.

Exception thrown:
org.opensearch.client.util.MissingRequiredPropertyException: Missing required property 'Builder.<variant kind>'

How can one reproduce the bug?

Reproduce the bug by deserializing from JSON:

String JSON = """
        {
          "analysis": {
            "tokenizer": {
              "my_pattern_split_tokenizer": {
                "type": "simple_pattern_split",
                "pattern": "-"
              }
            },
            "analyzer": {
              "my_pattern_split_analyzer": {
                "type": "custom",
                "tokenizer": "my_pattern_split_tokenizer"
              }
            }
          }
        }
    """;

JsonpMapper mapper = client._transport().jsonpMapper();
JsonParser parser = mapper.jsonProvider().createParser(new StringReader(JSON));

IndexSettings settings = IndexSettings._DESERIALIZER.deserialize(parser, mapper);

Reproduce the bug by getting an index which was created using these settings:

GetIndexRequest req = new GetIndexRequest.Builder()
        .index("test-index")
        .build();
GetIndexResponse resp = client.indices().get(req);

What is the expected behavior?

IndexSettings should be able to be deserialized from these settings because according to the documentation they're still supported tokenizers. The client should be able to get data for an index which uses these settings.

What is your host/environment?

macOS Sequoia 15.3

Do you have any additional context?

These settings work when reaching out to OpenSearch directly and appear to be supported by the High Level Rest Client. I'm encountering this issue now that I'm trying to migrate to the Java client. These tokenizer types aren't present in the TokenizerDefinition.

OpenSearch DSL:

PUT test-index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_pattern_split_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "-"
        }
      },
      "analyzer": {
        "my_pattern_split_analyzer": {
          "type": "custom",
          "tokenizer": "my_pattern_split_tokenizer"
        }
      }
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions