Feature request: add min-ready-instances to clustering for smart leader dispatch

## Problem

When running gNMIC in a Kubernetes StatefulSet cluster (15–25 replicas), the `leader-wait-timer` creates an unavoidable trade-off between cold-start safety and rolling-restart speed.

**Cold start scenario:** All pods start simultaneously (`podManagementPolicy: Parallel`). If the leader dispatches targets before other pods have registered with Consul, a small number of early pods receive all targets and OOM. A long `leader-wait-timer` (e.g. 300s) prevents this.

**Rolling restart scenario:** Pods restart one at a time. When the leader pod restarts and a new leader is elected, 14–24 other pods are already running and registered. The long `leader-wait-timer` still fires, causing an unnecessary 5-minute metrics collection gap despite no OOM risk.

There is no way to configure gNMIC to distinguish between these two scenarios — the timer is a fixed delay regardless of cluster state.

## Current behavior

In `pkg/app/clustering.go`, after a pod wins the leader lock, it unconditionally sleeps for `LeaderWaitTimer` before starting the loader and dispatching targets:

```go
go func() {
    go a.watchMembers(ctx)
    a.Logger.Printf("leader waiting %s before dispatching targets",
        a.Config.Clustering.LeaderWaitTimer)
    time.Sleep(a.Config.Clustering.LeaderWaitTimer)  // fixed delay
    a.Logger.Printf("leader done waiting, starting loader and dispatching targets")
    go a.startLoader(ctx)
    go a.dispatchTargets(ctx)
}()
```

Meanwhile, `watchMembers()` is already running concurrently and populating `a.apiServices` with healthy registered instances (via Consul TTL health checks). The leader *already knows* how many cluster members are ready — it just doesn't use that information.

## Proposed solution

Add a new clustering config field, `min-ready-instances`, that allows the leader to dispatch targets as soon as a sufficient number of cluster members have registered — while keeping `leader-wait-timer` as a maximum timeout.

### Config example

```yaml
clustering:
  leader-wait-timer: 300s       # maximum wait (safety net / timeout)
  min-ready-instances: 12       # dispatch as soon as 12 members registered
```

### Behavior

- If `min-ready-instances` is set, the leader polls `len(a.apiServices)` during the wait period
- As soon as `len(a.apiServices) >= min-ready-instances`, dispatch begins immediately
- If the threshold isn't reached within `leader-wait-timer`, dispatch proceeds anyway (current behavior, prevents infinite blocking)
- If `min-ready-instances` is not set (default `0`), behavior is unchanged — pure timer-based wait

### Implementation sketch

The change is localized to `startCluster()` in `pkg/app/clustering.go` and the config struct in `pkg/config/clustering.go`:

```go
// In pkg/config/clustering.go — add to struct:
MinReadyInstances int `mapstructure:"min-ready-instances,omitempty" ...`

// In pkg/app/clustering.go — replace time.Sleep with:
deadline := time.After(a.Config.Clustering.LeaderWaitTimer)
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
    select {
    case <-deadline:
        a.Logger.Printf("leader-wait-timer expired, dispatching with %d instances",
            len(a.apiServices))
        goto DISPATCH
    case <-ticker.C:
        a.configLock.RLock()
        n := len(a.apiServices)
        a.configLock.RUnlock()
        if a.Config.Clustering.MinReadyInstances > 0 && n >= a.Config.Clustering.MinReadyInstances {
            a.Logger.Printf("min-ready-instances threshold met (%d/%d), dispatching",
                n, a.Config.Clustering.MinReadyInstances)
            goto DISPATCH
        }
    case <-ctx.Done():
        return
    }
}
DISPATCH:
```

### Impact

| Scenario | Current (300s timer) | With min-ready-instances |
|----------|---------------------|--------------------------|
| Cold start (15 pods) | 5 min delay | ~30-60s (pods register quickly with Parallel policy) |
| Rolling restart | 5 min delay | ~2-4s (14 pods already registered) |
| Partial failure | 5 min delay | Waits until threshold OR timeout |

### Our deployment context

We run gNMIC v0.43.0 in production across multiple Kubernetes clusters:
- 15–25 replicas per cluster
- 200+ Arista/Junos targets per cluster
- Consul-based clustering with TTL health checks
- `podManagementPolicy: Parallel` StatefulSets
- The 5-minute gap during rolling restarts is our primary pain point

We're happy to contribute a PR if the maintainers are open to this approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: add min-ready-instances to clustering for smart leader dispatch #830

Problem

Current behavior

Proposed solution

Config example

Behavior

Implementation sketch

Impact

Our deployment context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Current (300s timer)	With min-ready-instances
Cold start (15 pods)	5 min delay	~30-60s (pods register quickly with Parallel policy)
Rolling restart	5 min delay	~2-4s (14 pods already registered)
Partial failure	5 min delay	Waits until threshold OR timeout

Feature request: add min-ready-instances to clustering for smart leader dispatch #830

Description

Problem

Current behavior

Proposed solution

Config example

Behavior

Implementation sketch

Impact

Our deployment context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions