Skip to content

Feature request: add min-ready-instances to clustering for smart leader dispatch #830

@drewelliott

Description

@drewelliott

Problem

When running gNMIC in a Kubernetes StatefulSet cluster (15–25 replicas), the leader-wait-timer creates an unavoidable trade-off between cold-start safety and rolling-restart speed.

Cold start scenario: All pods start simultaneously (podManagementPolicy: Parallel). If the leader dispatches targets before other pods have registered with Consul, a small number of early pods receive all targets and OOM. A long leader-wait-timer (e.g. 300s) prevents this.

Rolling restart scenario: Pods restart one at a time. When the leader pod restarts and a new leader is elected, 14–24 other pods are already running and registered. The long leader-wait-timer still fires, causing an unnecessary 5-minute metrics collection gap despite no OOM risk.

There is no way to configure gNMIC to distinguish between these two scenarios — the timer is a fixed delay regardless of cluster state.

Current behavior

In pkg/app/clustering.go, after a pod wins the leader lock, it unconditionally sleeps for LeaderWaitTimer before starting the loader and dispatching targets:

go func() {
    go a.watchMembers(ctx)
    a.Logger.Printf("leader waiting %s before dispatching targets",
        a.Config.Clustering.LeaderWaitTimer)
    time.Sleep(a.Config.Clustering.LeaderWaitTimer)  // fixed delay
    a.Logger.Printf("leader done waiting, starting loader and dispatching targets")
    go a.startLoader(ctx)
    go a.dispatchTargets(ctx)
}()

Meanwhile, watchMembers() is already running concurrently and populating a.apiServices with healthy registered instances (via Consul TTL health checks). The leader already knows how many cluster members are ready — it just doesn't use that information.

Proposed solution

Add a new clustering config field, min-ready-instances, that allows the leader to dispatch targets as soon as a sufficient number of cluster members have registered — while keeping leader-wait-timer as a maximum timeout.

Config example

clustering:
  leader-wait-timer: 300s       # maximum wait (safety net / timeout)
  min-ready-instances: 12       # dispatch as soon as 12 members registered

Behavior

  • If min-ready-instances is set, the leader polls len(a.apiServices) during the wait period
  • As soon as len(a.apiServices) >= min-ready-instances, dispatch begins immediately
  • If the threshold isn't reached within leader-wait-timer, dispatch proceeds anyway (current behavior, prevents infinite blocking)
  • If min-ready-instances is not set (default 0), behavior is unchanged — pure timer-based wait

Implementation sketch

The change is localized to startCluster() in pkg/app/clustering.go and the config struct in pkg/config/clustering.go:

// In pkg/config/clustering.go — add to struct:
MinReadyInstances int `mapstructure:"min-ready-instances,omitempty" ...`

// In pkg/app/clustering.go — replace time.Sleep with:
deadline := time.After(a.Config.Clustering.LeaderWaitTimer)
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
    select {
    case <-deadline:
        a.Logger.Printf("leader-wait-timer expired, dispatching with %d instances",
            len(a.apiServices))
        goto DISPATCH
    case <-ticker.C:
        a.configLock.RLock()
        n := len(a.apiServices)
        a.configLock.RUnlock()
        if a.Config.Clustering.MinReadyInstances > 0 && n >= a.Config.Clustering.MinReadyInstances {
            a.Logger.Printf("min-ready-instances threshold met (%d/%d), dispatching",
                n, a.Config.Clustering.MinReadyInstances)
            goto DISPATCH
        }
    case <-ctx.Done():
        return
    }
}
DISPATCH:

Impact

Scenario Current (300s timer) With min-ready-instances
Cold start (15 pods) 5 min delay ~30-60s (pods register quickly with Parallel policy)
Rolling restart 5 min delay ~2-4s (14 pods already registered)
Partial failure 5 min delay Waits until threshold OR timeout

Our deployment context

We run gNMIC v0.43.0 in production across multiple Kubernetes clusters:

  • 15–25 replicas per cluster
  • 200+ Arista/Junos targets per cluster
  • Consul-based clustering with TTL health checks
  • podManagementPolicy: Parallel StatefulSets
  • The 5-minute gap during rolling restarts is our primary pain point

We're happy to contribute a PR if the maintainers are open to this approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions