[BUG] Unusual behavior of RFS and Backfill

### What is the bug?

The bug involves unexpected behavior during the migration process from ElasticSearch 7.10 to OpenSearch 2.17, both hosted on AWS OpenSearch Service. Specifically:
- target cluster size (pri.store.size) unexpectedly drops by 5-10GB multiple times during backfill.
- document count on the target cluster shows periods of no increase during active backfill.
-  RFS Reindexing Traffic graph shows discontinuities.
-  number of RFS workers fluctuates dramatically and unexpectedly.

### What are your migration environments?

- Managed Service to Managed Service
- Source: ElasticSearch 7.10 on AWS OpenSearch Service
- Target: OpenSearch 2.17 on AWS OpenSearch Service
- Data size: ~100GB
- Number of shards: 21 (with most data in 5 shards)

### How can one reproduce the bug?

- Take a snapshot and perform console metadata migrate.
- Start backfill with 5 workers.
- Scale up to 21 workers.
- Further scale up to 42 workers.
- Monitor the following metrics during the process:
1. Target cluster size (pri.store.size on cat-indices)
2. TARGET CLUSTER Document SUM Count
3. RFS Reindexing Traffic
4. Number of RFS workers reporting in

### What is the expected behavior? 

- target cluster size should steadily increase or remain stable during backfill. 
- document count should consistently increase during active backfill.
- RFS Reindexing Traffic graph should show a continuous line without cut-offs.
- number of RFS workers should remain stable at the set number (5, 21, or 42) without unexpected drops.
- 

### Do you have any additional context?

- target cluster size drops were verified through CloudWatch "Target Cluster Used Space" graph showing significant negative slopes.
- document count showed 4 instances of 0 slope, indicating no increase in total documents for extended periods.
- RFS Reindexing Traffic graph showed 3-4 instances of cut-offs and resumptions, with traffic ranging from approximately 500MB to 2GB.
- RFS worker count unexpectedly dropped from 21 to 5, 42 to 5, and 42 to 24 at various points.

<img width="1381" alt="Image" src="https://github.com/user-attachments/assets/fbff68eb-53b0-417c-9de3-cfb92db59903" />
<img width="1381" alt="Image" src="https://github.com/user-attachments/assets/2df085bd-0d03-4165-990f-7ce25fcbef4e" />
<img width="1381" alt="Image" src="https://github.com/user-attachments/assets/b877bfa7-cf16-4fed-ad34-0ec8901a1a83" />
<img width="1381" alt="Image" src="https://github.com/user-attachments/assets/9dd68cc3-4dfd-41d3-9539-e5d9e9be2459" />

Additional question: How can one better identify if the TARGET Cluster has been overloaded?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Unusual behavior of RFS and Backfill #1267

What is the bug?

What are your migration environments?

How can one reproduce the bug?

What is the expected behavior?

Do you have any additional context?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Unusual behavior of RFS and Backfill #1267

Description

What is the bug?

What are your migration environments?

How can one reproduce the bug?

What is the expected behavior?

Do you have any additional context?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions