Skip to content

[RFC] Delta Snapshot Reindex with RFS #15

@AndreKurait

Description

@AndreKurait

Delta Snapshot Reindex with RFS

RFC Proposal


What / Why

We propose Delta Snapshot Reindex (DSR), an enhancement to Reindex‑from‑Snapshot (RFS) that applies only the changes between two snapshots to a target cluster, rather than full reingestion. This reduces unnecessary I/O, network, and compute overhead when updating historical data already backfilled via RFS or Snapshot/Restore (see OpenSearch Migration Assistant docs: https://opensearch.org/docs/latest/migration-assistant/).


Who Wants This?

  • Operators maintaining standby clusters who need periodic, incremental updates.
  • Data-lake integrators: managing cold or archived OpenSearch clusters who periodically snapshot data for long-term storage and need to apply only incremental changes (deltas) to downstream systems or refreshed clusters, avoiding full reprocessing.
  • Migration Operators: optimizing backfill resource costs while reducing lag between source and target clusters.

Problem Statement

Updating a target index currently requires deleting it and re‑ingesting all documents to guarantee consistency. This:

  1. Wastes network/CPU on unchanged docs.
  2. Causes write amplification on the target cluster.
  3. Causes a high RTO for the migration without other solutions such as Capture and Replay

Proposed Solution

  1. Inputs:
    • Old snapshot name (already applied to target).
    • New snapshot name (later/earlier point‑in‑time).
  2. Per‑shard, per‑segment diff:
    • Parse Lucene segment files from both snapshots.
    • Compare segment names and live‑docs bitsets.
  3. Delta determination:
Segment Scenario Action
In old but not in new Delete all live docs from that segment.
In new but not in old Add all live docs from that segment.
In both snapshots Compare bitsets:
- Old ∖ New → delete
- New ∖ Old → add
  1. Apply deltas:
    • First execute all deletes, then all adds/updates.
  2. Merge‑optimization (Future enhancement):
    • Track a hashmap of _id → hash(_source); if a delete and add target the same _id with identical hash, skip both.

Value for Disaster Recovery Standby Clusters

Building DR standby clusters—especially cross‑region—requires minimizing RTO/RPO and bandwidth/storage costs. DSR brings significant advantages:

  • Reduced RTO: Instead of full reingest (often hours/days for terabytes), DSR applies only deltas, enabling reindex in minutes.
  • Improved RPO: Hourly or sub-hourly snapshots yield an RPO of 15m - 1h;
  • Cross‑region efficiency: Transferring only changed segments (often <10% of total data) reduces inter‑region egress costs.
  • Cost savings: Fewer document ops (e.g. 90% fewer at 10% change rate) means smaller target cluster sizing and reduced compute/network spend.

Note: Cross‑Cluster Replication (CCR) offers continuous replication but incurs per‑write network cost and couples DR cluster performance to production load (see OpenSearch CCR docs: https://opensearch.org/docs/latest/cluster-management-ccr/).


Limitations & Drawbacks

  • Segment merges can trigger full‑segment deletes/adds unless optimized via the hashmap.
  • _source:false indices is not compatible (same as RFS).
  • Plugin classes: custom plugins may need to be on the DSR application classpath.

Alternatives Considered

  1. Log‑based CDC: Invasive change‑stream setup. Limited compatibility across older versions.

  2. Reindex REST API: Burdens source cluster; no delete detection.

  3. Incremental Snapshot/Restore: Unsupported beyond one version.

  4. Native Delta Reindex in Snapshot/Restore API:
    Embedding delta logic directly into the Snapshot/Restore mechanism (rather than building around Reindex-from-Snapshot) could offer a cleaner, lower-latency path for applying snapshot diffs. This would allow:

    • Repository-level segment diffing and blob reuse.
    • Elimination of reindex overhead and custom coordination logic.
    • Tighter integration with existing DR workflows.

    Drawbacks: Requires deeper changes to core restore flows and compatibility handling across OpenSearch/Lucene versions. Slower to implement and validate across all repository types.


Feedback Requested

  • Feasibility and scale of per‑segment bitset comparisons.
  • Suggestions for efficiently processing segment merges
  • Special data types needing extra care (e.g. nested docs, binary fields, geo‑shapes, custom analyzers, etc.).
  • Additional DR considerations for multi‑region compliance, network constraints, and security.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions