[RFC] Delta Snapshot Reindex with RFS

## Delta Snapshot Reindex with RFS  
**RFC Proposal**

---

### What / Why  
We propose **Delta Snapshot Reindex (DSR)**, an enhancement to Reindex‑from‑Snapshot (RFS) that applies only the _changes_ between two snapshots to a target cluster, rather than full reingestion. This reduces unnecessary I/O, network, and compute overhead when updating historical data already backfilled via RFS or Snapshot/Restore (see OpenSearch Migration Assistant docs: https://opensearch.org/docs/latest/migration-assistant/).

---

### Who Wants This?  
- **Operators** maintaining standby clusters who need periodic, incremental updates. 
- **Data-lake integrators**: managing cold or archived OpenSearch clusters who periodically snapshot data for long-term storage and need to apply only incremental changes (deltas) to downstream systems or refreshed clusters, avoiding full reprocessing.
- **Migration Operators**: optimizing backfill resource costs while reducing lag between source and target clusters.

---

### Problem Statement  
Updating a target index currently requires deleting it and re‑ingesting all documents to guarantee consistency. This:  
1. Wastes network/CPU on unchanged docs.  
2. Causes write amplification on the target cluster.  
3. Causes a high RTO for the migration without other solutions such as Capture and Replay
---

### Proposed Solution  
1. **Inputs**:  
   - **Old snapshot name** (already applied to target).  
   - **New snapshot name** (later/earlier point‑in‑time).  
2. **Per‑shard, per‑segment diff**:  
   - Parse Lucene segment files from both snapshots.  
   - Compare segment names and live‑docs bitsets.  
3. **Delta determination**:

| Segment Scenario                            | Action                                            |
|---------------------------------------------|---------------------------------------------------|
| In **old** but **not** in **new**           | **Delete** all live docs from that segment.       |
| In **new** but **not** in **old**           | **Add** all live docs from that segment.          |
| In **both** snapshots                       | Compare bitsets:<br>- Old ∖ New → **delete**<br>- New ∖ Old → **add** |

4. **Apply deltas**:  
   - First execute all deletes, then all adds/updates.  
5. **Merge‑optimization (Future enhancement)**:  
   - Track a hashmap of `_id` → hash(`_source`); if a delete and add target the same `_id` with identical hash, **skip both**.

---

### Value for Disaster Recovery Standby Clusters  
Building DR standby clusters—especially cross‑region—requires minimizing RTO/RPO and bandwidth/storage costs. DSR brings significant advantages:

- **Reduced RTO**: Instead of full reingest (often hours/days for terabytes), DSR applies only deltas, enabling reindex in minutes.
- **Improved RPO**: Hourly or sub-hourly snapshots yield an RPO of 15m - 1h; 
- **Cross‑region efficiency**: Transferring only changed segments (often <10% of total data) reduces inter‑region egress costs.  
- **Cost savings**: Fewer document ops (e.g. 90% fewer at 10% change rate) means smaller target cluster sizing and reduced compute/network spend.

> *Note:* Cross‑Cluster Replication (CCR) offers continuous replication but incurs per‑write network cost and couples DR cluster performance to production load (see OpenSearch CCR docs: https://opensearch.org/docs/latest/cluster-management-ccr/).

---

### Limitations & Drawbacks  
- **Segment merges** can trigger full‑segment deletes/adds unless optimized via the hashmap.  
- **`_source:false` indices** is not compatible (same as RFS).  
- **Plugin classes**: custom plugins may need to be on the DSR application classpath.

---

### Alternatives Considered

1. **Log‑based CDC**: Invasive change‑stream setup. Limited compatibility across older versions.  
2. **Reindex REST API**: Burdens source cluster; no delete detection.  
3. **Incremental Snapshot/Restore**: Unsupported beyond one version.  
4. **Native Delta Reindex in Snapshot/Restore API**:  
   Embedding delta logic directly into the `Snapshot/Restore` mechanism (rather than building around Reindex-from-Snapshot) could offer a cleaner, lower-latency path for applying snapshot diffs. This would allow:
   - Repository-level segment diffing and blob reuse.
   - Elimination of reindex overhead and custom coordination logic.
   - Tighter integration with existing DR workflows.

   > **Drawbacks**: Requires deeper changes to core restore flows and compatibility handling across OpenSearch/Lucene versions. Slower to implement and validate across all repository types.
---

### Feedback Requested  
- Feasibility and scale of per‑segment bitset comparisons.  
- Suggestions for efficiently processing segment merges
- **Special data types** needing extra care (e.g. nested docs, binary fields, geo‑shapes, custom analyzers, etc.).  
- Additional **DR considerations** for multi‑region compliance, network constraints, and security.  

Segment Scenario	Action
In old but not in new	Delete all live docs from that segment.
In new but not in old	Add all live docs from that segment.
In both snapshots	Compare bitsets: - Old ∖ New → delete - New ∖ Old → add

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Delta Snapshot Reindex with RFS #15

Delta Snapshot Reindex with RFS

What / Why

Who Wants This?

Problem Statement

Proposed Solution

Value for Disaster Recovery Standby Clusters

Limitations & Drawbacks

Alternatives Considered

Feedback Requested

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[RFC] Delta Snapshot Reindex with RFS #15

Description

Delta Snapshot Reindex with RFS

What / Why

Who Wants This?

Problem Statement

Proposed Solution

Value for Disaster Recovery Standby Clusters

Limitations & Drawbacks

Alternatives Considered

Feedback Requested

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions