ingest: persistent 503s after non-graceful scale-down of indexers

**Describe the bug**

When one or more indexer nodes are removed from the cluster without going through the Retiring state (e.g., abrupt pod termination in Kubernetes), their ingest shards remain marked as Open in the metastore indefinitely. 

This causes persistent 503 / NoShardsAvailable errors on all ingest requests for the affected indexes until the index is deleted and recreated.

**Steps to reproduce (if applicable)**

1. Create an index with an ingest source
2. Start ingesting
3. Scale up indexers, then scale them back down abruptly (kill pods, no graceful drain)
4. All ingest requests for the index return 503
5. Restarting the whole cluster doesn't help (dead shards are reloaded from metastore)
6. Only deleting and recreating the index restores ingestion

**Root cause**

Two compounding issues:
- Control plane — `compute_shards_to_rebalance` only handles shards on Retiring ingesters. Shards whose leader is completely absent from the ingester pool (node removed without draining) are silently skipped. No replacement shard is ever opened.
- Router — when pick_node returns None because all routing-table entries point to dead ingesters (not in pool), the router records NoShardsAvailable which does not add the dead nodes to unavailable_leaders. So the control plane is never told the leaders are unreachable and never opens new shards.

**Expected behavior**
Reshard lead 

**Configuration:**

Quickwit edge tag `quickwit:edge@sha256:fd88ac3e41148a13ba16f722d6855fd626de516663f63ded4b624b2cd74d0d47`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: persistent 503s after non-graceful scale-down of indexers #6480

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ingest: persistent 503s after non-graceful scale-down of indexers #6480

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions