generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 40
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
What is the bug?
The bug involves unexpected behavior during the migration process from ElasticSearch 7.10 to OpenSearch 2.17, both hosted on AWS OpenSearch Service. Specifically:
- target cluster size (pri.store.size) unexpectedly drops by 5-10GB multiple times during backfill.
- document count on the target cluster shows periods of no increase during active backfill.
- RFS Reindexing Traffic graph shows discontinuities.
- number of RFS workers fluctuates dramatically and unexpectedly.
What are your migration environments?
- Managed Service to Managed Service
- Source: ElasticSearch 7.10 on AWS OpenSearch Service
- Target: OpenSearch 2.17 on AWS OpenSearch Service
- Data size: ~100GB
- Number of shards: 21 (with most data in 5 shards)
How can one reproduce the bug?
- Take a snapshot and perform console metadata migrate.
- Start backfill with 5 workers.
- Scale up to 21 workers.
- Further scale up to 42 workers.
- Monitor the following metrics during the process:
- Target cluster size (pri.store.size on cat-indices)
- TARGET CLUSTER Document SUM Count
- RFS Reindexing Traffic
- Number of RFS workers reporting in
What is the expected behavior?
- target cluster size should steadily increase or remain stable during backfill.
- document count should consistently increase during active backfill.
- RFS Reindexing Traffic graph should show a continuous line without cut-offs.
- number of RFS workers should remain stable at the set number (5, 21, or 42) without unexpected drops.
Do you have any additional context?
- target cluster size drops were verified through CloudWatch "Target Cluster Used Space" graph showing significant negative slopes.
- document count showed 4 instances of 0 slope, indicating no increase in total documents for extended periods.
- RFS Reindexing Traffic graph showed 3-4 instances of cut-offs and resumptions, with traffic ranging from approximately 500MB to 2GB.
- RFS worker count unexpectedly dropped from 21 to 5, 42 to 5, and 42 to 24 at various points.




Additional question: How can one better identify if the TARGET Cluster has been overloaded?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working