-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Delta Snapshot Reindex with RFS
RFC Proposal
What / Why
We propose Delta Snapshot Reindex (DSR), an enhancement to Reindex‑from‑Snapshot (RFS) that applies only the changes between two snapshots to a target cluster, rather than full reingestion. This reduces unnecessary I/O, network, and compute overhead when updating historical data already backfilled via RFS or Snapshot/Restore (see OpenSearch Migration Assistant docs: https://opensearch.org/docs/latest/migration-assistant/).
Who Wants This?
- Operators maintaining standby clusters who need periodic, incremental updates.
- Data-lake integrators: managing cold or archived OpenSearch clusters who periodically snapshot data for long-term storage and need to apply only incremental changes (deltas) to downstream systems or refreshed clusters, avoiding full reprocessing.
- Migration Operators: optimizing backfill resource costs while reducing lag between source and target clusters.
Problem Statement
Updating a target index currently requires deleting it and re‑ingesting all documents to guarantee consistency. This:
- Wastes network/CPU on unchanged docs.
- Causes write amplification on the target cluster.
- Causes a high RTO for the migration without other solutions such as Capture and Replay
Proposed Solution
- Inputs:
- Old snapshot name (already applied to target).
- New snapshot name (later/earlier point‑in‑time).
- Per‑shard, per‑segment diff:
- Parse Lucene segment files from both snapshots.
- Compare segment names and live‑docs bitsets.
- Delta determination:
Segment Scenario | Action |
---|---|
In old but not in new | Delete all live docs from that segment. |
In new but not in old | Add all live docs from that segment. |
In both snapshots | Compare bitsets: - Old ∖ New → delete - New ∖ Old → add |
- Apply deltas:
- First execute all deletes, then all adds/updates.
- Merge‑optimization (Future enhancement):
- Track a hashmap of
_id
→ hash(_source
); if a delete and add target the same_id
with identical hash, skip both.
- Track a hashmap of
Value for Disaster Recovery Standby Clusters
Building DR standby clusters—especially cross‑region—requires minimizing RTO/RPO and bandwidth/storage costs. DSR brings significant advantages:
- Reduced RTO: Instead of full reingest (often hours/days for terabytes), DSR applies only deltas, enabling reindex in minutes.
- Improved RPO: Hourly or sub-hourly snapshots yield an RPO of 15m - 1h;
- Cross‑region efficiency: Transferring only changed segments (often <10% of total data) reduces inter‑region egress costs.
- Cost savings: Fewer document ops (e.g. 90% fewer at 10% change rate) means smaller target cluster sizing and reduced compute/network spend.
Note: Cross‑Cluster Replication (CCR) offers continuous replication but incurs per‑write network cost and couples DR cluster performance to production load (see OpenSearch CCR docs: https://opensearch.org/docs/latest/cluster-management-ccr/).
Limitations & Drawbacks
- Segment merges can trigger full‑segment deletes/adds unless optimized via the hashmap.
_source:false
indices is not compatible (same as RFS).- Plugin classes: custom plugins may need to be on the DSR application classpath.
Alternatives Considered
-
Log‑based CDC: Invasive change‑stream setup. Limited compatibility across older versions.
-
Reindex REST API: Burdens source cluster; no delete detection.
-
Incremental Snapshot/Restore: Unsupported beyond one version.
-
Native Delta Reindex in Snapshot/Restore API:
Embedding delta logic directly into theSnapshot/Restore
mechanism (rather than building around Reindex-from-Snapshot) could offer a cleaner, lower-latency path for applying snapshot diffs. This would allow:- Repository-level segment diffing and blob reuse.
- Elimination of reindex overhead and custom coordination logic.
- Tighter integration with existing DR workflows.
Drawbacks: Requires deeper changes to core restore flows and compatibility handling across OpenSearch/Lucene versions. Slower to implement and validate across all repository types.
Feedback Requested
- Feasibility and scale of per‑segment bitset comparisons.
- Suggestions for efficiently processing segment merges
- Special data types needing extra care (e.g. nested docs, binary fields, geo‑shapes, custom analyzers, etc.).
- Additional DR considerations for multi‑region compliance, network constraints, and security.