-
Notifications
You must be signed in to change notification settings - Fork 40
Description
TLDR: When using MetadataMigration tool with indices that have routing_partition_size
configured, the routing distribution is not preserved due to missing number_of_routing_shards
setting. Reproduction of the OpenSearch part of the issue: https://github.com/camerondurham/bug-repro-opensearch-routing
What is the issue/bug?
The configured routing partition size (5) is not being respected during bulk writes to the destination cluster. Documents are being distributed across fewer shards than configured (1-2 shards instead of 5), causing potential hotspotting issues. This is happening because the number_of_routing_shards
doesn't appear to be set when copying metadata and creating the index on destination clusters.
What are your migration environments?
- AWS managed OpenSearch 1.3.x
- 6 x i3.4xlarge.search (Data nodes)
- 3 x c5.large.search (Master nodes)
Both cluster indexes configured with routing_partition_size: 5
How can one reproduce the issue/bug?
- Create a snapshot repository
- Use MetadataMigration tool to copy settings from source to destination cluster
- Execute reindex-from-snapshot operation
- Observe shard distribution patterns - documents will be concentrated in 1-2 shards instead of being distributed across 5 shards as configured
What is the expected behavior?
Documents should be evenly distributed across all 5 shards as specified by the routing_partition_size setting to prevent hotspotting. The shard distribution should match the source cluster's distribution pattern.
Do you have any additional context?
This caused errors for us when trying to migrate, since docs on routes previously spread across 5 shards were concentrated on 1, causing errors like:
"error": {
"type": "illegal_argument_exception",
"reason": "Number of documents in the index can't exceed [2147483519]"
}
Root Cause:
The MetadataMigration tool doesn't copy both required settings when routing_partition_size is present:
- settings.index.number_of_shards (copied)
- settings.index.number_of_routing_shards (not copied)
Resolution:
The issue can be resolved by explicitly setting both required settings during index creation, rather than relying on the MetadataMigration tool. This is related to some not well documented behavior inherited from Elasticsearch 7.x (elastic/elasticsearch #48863) that persists in OpenSearch 2.x (opensearch-project/OpenSearch #17472).
Verification Analysis:
Before fix: Source cluster shows 5-shard distribution (47,48,49,50,51), while target shows 2-shard distribution (5,6)
After fix: Both source and target show identical 5-shard distribution (47,48,49,50,51)
I reproduced the OpenSearch part of the issue here, in both OS 1.x and 2.x: https://github.com/camerondurham/bug-repro-opensearch-routing