Skip to content

Add persistent retry queue for temporary write failures during repartition/failover #8085

@killme2008

Description

@killme2008

What problem does the new feature solve?

During repartition, region failover, or transient routing inconsistency, the frontend may see temporary write failures. In many cases, these failures are not caused by invalid requests, but because the target region is temporarily unavailable or routing/metadata has not converged yet.

Today, these writes are returned to clients as errors directly and require the clients to retry, which hurts write availability during those transition windows.

What does the feature do?

Add a retry queue in the GreptimeDB frontend to buffer write requests that fail due to transient errors, and replay them once the target region becomes available again.

The retry queue should support:

  • Local buffering of retryable failed writes
  • Persistence so queued requests survive frontend restarts
  • Asynchronous replay after region recovery or route convergence
  • Resource control, such as queue size, disk usage, retry limits, and TTL

Implementation challenges

  1. Retryable error classification
  2. Persistence format
  3. Idempotency, ordering and isolation
  4. Backpressure and limits

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions