Skip to content

[Feature] Guardrails in MetadataMigration Tool to Preserve Routing-Related Settings #1465

@camerondurham

Description

@camerondurham

TLDR: When using MetadataMigration tool with indices that have routing_partition_size configured, the routing distribution is not preserved due to missing number_of_routing_shards setting. Reproduction of the OpenSearch part of the issue: https://github.com/camerondurham/bug-repro-opensearch-routing

What is the issue/bug?

The configured routing partition size (5) is not being respected during bulk writes to the destination cluster. Documents are being distributed across fewer shards than configured (1-2 shards instead of 5), causing potential hotspotting issues. This is happening because the number_of_routing_shards doesn't appear to be set when copying metadata and creating the index on destination clusters.

What are your migration environments?

  • AWS managed OpenSearch 1.3.x
  • 6 x i3.4xlarge.search (Data nodes)
  • 3 x c5.large.search (Master nodes)

Both cluster indexes configured with routing_partition_size: 5

How can one reproduce the issue/bug?

  1. Create a snapshot repository
  2. Use MetadataMigration tool to copy settings from source to destination cluster
  3. Execute reindex-from-snapshot operation
  4. Observe shard distribution patterns - documents will be concentrated in 1-2 shards instead of being distributed across 5 shards as configured

What is the expected behavior?

Documents should be evenly distributed across all 5 shards as specified by the routing_partition_size setting to prevent hotspotting. The shard distribution should match the source cluster's distribution pattern.

Do you have any additional context?

This caused errors for us when trying to migrate, since docs on routes previously spread across 5 shards were concentrated on 1, causing errors like:

"error": {
  "type": "illegal_argument_exception",
  "reason": "Number of documents in the index can't exceed [2147483519]"
}

Root Cause:
The MetadataMigration tool doesn't copy both required settings when routing_partition_size is present:

  • settings.index.number_of_shards (copied)
  • settings.index.number_of_routing_shards (not copied)

Resolution:
The issue can be resolved by explicitly setting both required settings during index creation, rather than relying on the MetadataMigration tool. This is related to some not well documented behavior inherited from Elasticsearch 7.x (elastic/elasticsearch #48863) that persists in OpenSearch 2.x (opensearch-project/OpenSearch #17472).

Verification Analysis:

Before fix: Source cluster shows 5-shard distribution (47,48,49,50,51), while target shows 2-shard distribution (5,6)
After fix: Both source and target show identical 5-shard distribution (47,48,49,50,51)

I reproduced the OpenSearch part of the issue here, in both OS 1.x and 2.x: https://github.com/camerondurham/bug-repro-opensearch-routing

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions