Add persistent retry queue for temporary write failures during repartition/failover

### What problem does the new feature solve?

During repartition, region failover, or transient routing inconsistency, the frontend may see temporary write failures. In many cases, these failures are not caused by invalid requests, but because the target region is temporarily unavailable or routing/metadata has not converged yet.

Today, these writes are returned to clients as errors directly and require the clients to retry, which hurts write availability during those transition windows.

### What does the feature do?

Add a retry queue in the GreptimeDB frontend to buffer write requests that fail due to transient errors, and replay them once the target region becomes available again.

The retry queue should support:

* Local buffering of retryable failed writes
* Persistence so queued requests survive frontend restarts
* Asynchronous replay after region recovery or route convergence
* Resource control, such as queue size, disk usage, retry limits, and TTL

### Implementation challenges

1. Retryable error classification
2. Persistence format
3. Idempotency, ordering and isolation
4. Backpressure and limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persistent retry queue for temporary write failures during repartition/failover #8085

What problem does the new feature solve?

What does the feature do?

Implementation challenges

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add persistent retry queue for temporary write failures during repartition/failover #8085

Description

What problem does the new feature solve?

What does the feature do?

Implementation challenges

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions