Skip to content

ingest: persistent 503s after non-graceful scale-down of indexers #6480

@nico34638

Description

@nico34638

Describe the bug

When one or more indexer nodes are removed from the cluster without going through the Retiring state (e.g., abrupt pod termination in Kubernetes), their ingest shards remain marked as Open in the metastore indefinitely.

This causes persistent 503 / NoShardsAvailable errors on all ingest requests for the affected indexes until the index is deleted and recreated.

Steps to reproduce (if applicable)

  1. Create an index with an ingest source
  2. Start ingesting
  3. Scale up indexers, then scale them back down abruptly (kill pods, no graceful drain)
  4. All ingest requests for the index return 503
  5. Restarting the whole cluster doesn't help (dead shards are reloaded from metastore)
  6. Only deleting and recreating the index restores ingestion

Root cause

Two compounding issues:

  • Control plane — compute_shards_to_rebalance only handles shards on Retiring ingesters. Shards whose leader is completely absent from the ingester pool (node removed without draining) are silently skipped. No replacement shard is ever opened.
  • Router — when pick_node returns None because all routing-table entries point to dead ingesters (not in pool), the router records NoShardsAvailable which does not add the dead nodes to unavailable_leaders. So the control plane is never told the leaders are unreachable and never opens new shards.

Expected behavior
Reshard lead

Configuration:

Quickwit edge tag quickwit:edge@sha256:fd88ac3e41148a13ba16f722d6855fd626de516663f63ded4b624b2cd74d0d47

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions