Skip to content

Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

@sumobrian

Description

@sumobrian

Is your feature request related to a problem? Please describe.

Currently, the reindex-from-snapshot process will attempt to migrate all docs in a shard with one worker and will only mark the shard as completed once all docs are migrated in a single attempt. This increases migration time and risk for larger shards as the first docs in each shard may be retried several times before succeeding. The reliance on a long-running single worker for large shards increases risk of failure increasing the time to complete the migration.

Describe the solution you'd like

Implement the ability to regularly checkpoint and resume migration of shards, which limits the amount of duplicate times a doc is migrated particularly for large shards. Implement a ceiling on the duration a lease for a work-item can reach.

Describe alternatives you've considered

  • We can upfront split up the shard into sub-shard work items that can be migrated in parallel, but this introduces complexity and increases unevenness of work distribution in the target cluster as when an index shard count remains the same during a migration, each sub-shard worker for a given source shard will hit a single node/shard in the target cluster.

Additional context

Jira Epic(s)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions