Skip to content

[Bug]: SalvageCheckpoint feature unreachable after force-promote (GetReplicateInfo handler + retry policy) #50344

@czs007

Description

@czs007

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.6.18 (verified). The defect path is unchanged since v2.6.9 (force-promote + storage v2 era).
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka (also affects pulsar / woodpecker — the bug is in the proxy handler, not MQ-specific)
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 3.1.0rc8 (any version that exposes GetReplicateInfo)
- OS(Ubuntu or CentOS): Ubuntu 22.04
- Helm chart: milvus/milvus 5.0.22 (appVersion 2.6.18)

Current Behavior

The SalvageCheckpoint feature is functionally unreachable from any client in v2.6.x. Two distinct defects compound — both surface as "GetReplicateInfo hangs to deadline-exceeded after force-promote", but the root causes are independent and each is worth fixing on its own.


Defect 1 — GetReplicateInfo handler returns early on standalone primary, blocking SalvageCheckpoint access

Affected location: internal/proxy/impl.goProxy.GetReplicateInfo handler.

The handler calls GetReplicateCheckpoint first, which is expected to fail in standalone-primary state ("wal is not a secondary cluster in replicating topology"). On that failure the handler returns immediately. The follow-up GetSalvageCheckpoint call — the entire reason SalvageCheckpoint exists — is never reached.

// internal/proxy/impl.go (current master / v2.6.18)
checkpoint, err := streaming.WAL().Replicate().GetReplicateCheckpoint(ctx, req.GetTargetPchannel())
if err != nil {
    return nil, err   // ← standalone primary fails here, GetSalvageCheckpoint is never called
}

// dead code on a standalone-primary cluster
salvageCheckpoints, err := streaming.WAL().Replicate().GetSalvageCheckpoint(ctx, req.GetTargetPchannel())

This is the only client-facing API that exposes salvage_checkpoint. No alternative entry point exists in the proxy. So after force_promote, the SalvageCheckpoint that milvus dutifully persisted to etcd can never be read back by business code — which is exactly when it's needed.

Defect 2 — Client retries STREAMING_CODE_REPLICATE_VIOLATION to deadline

Affected location: internal/streamingnode/client/handler/handler_client_impl.gocreateHandlerAfterStreamingNodeReady.

When the streamingnode returns STREAMING_CODE_REPLICATE_VIOLATION ("wal is not a secondary cluster in replicating topology"), the client treats it as a transient error and re-creates the handler in a tight loop until the caller's context cancels.

This error is permanent for the lifetime of the current WAL role — no amount of retrying will make a standalone-primary WAL transition back to secondary. Callers see a deadline-exceeded after their full timeout, with no actionable error to programme against.

User-visible symptom: GetReplicateInfo and any other call going through this client path appears to "hang" rather than fail clearly. This is what makes Defect 1 read like "the API is broken" rather than "this state isn't supported".

Streamingnode warn logs during the retry storm:

[WARN] [handler/handler_client_impl.go:293] ["create handler failed"]
    [pchannel=...] [handler="replicate checkpoint"]
    [error="/milvus.proto.streaming.StreamingNodeHandlerService/GetReplicateCheckpoint;
      streaming error: code = STREAMING_CODE_REPLICATE_VIOLATION,
      cause = wal is not a secondary cluster in replicating topology; rpc error: code = Unknown, desc = "]

…repeated every few hundred ms for the entire client timeout window.

Expected Behavior

  1. GetReplicateInfo on a standalone-primary cluster should still return salvage_checkpoint. That's the whole point of persisting it across the force-promote transition.
  2. STREAMING_CODE_REPLICATE_VIOLATION should surface as a clear, non-retriable error within RTT, not after a full client deadline.

Steps To Reproduce

1. Stand up two milvus 2.6.18 clusters A (primary) and B (standby), 16 pchannels each, cluster replication via UpdateReplicateConfiguration.
2. Push some writes through A; let CDC advance.
3. Force-promote B:
       UpdateReplicateConfigurationRequest{
         replicate_configuration: { clusters: [B], cross_cluster_topology: [] },
         force_promote: true,
       }
4. Call GetReplicateInfo on B:
       GetReplicateInfoRequest{
         source_cluster_id: "<A's cluster_id>",
         target_pchannel: "<B's pchannel name>",
       }
5. Observe:
   - Client log fills with `create handler failed` lines carrying STREAMING_CODE_REPLICATE_VIOLATION.
   - RPC eventually returns DEADLINE_EXCEEDED (caller-side timeout).
   - `salvage_checkpoint` is never returned — even though milvus did write it to etcd under `streamingcoord-meta/salvage-checkpoint/<source_id>/<pchannel>`.

Anything else?

Suggested fix

Defect 1 — any of these, in order of minimum diff:

  1. Treat GetReplicateCheckpoint failure as non-fatal in GetReplicateInfo; continue to GetSalvageCheckpoint and return whatever subset is available (with checkpoint=nil if live checkpoint is unreachable).
  2. Reorder: call GetSalvageCheckpoint first; only call GetReplicateCheckpoint if the target is still in secondary state.
  3. Expose a dedicated GetSalvageCheckpoint proxy RPC alongside GetReplicateInfo. This is the cleanest API-wise but the largest change.

Option 1 is backwards-compatible and a few lines.

Defect 2 — classify STREAMING_CODE_REPLICATE_VIOLATION as a permanent error in createHandlerAfterStreamingNodeReady's retry policy. Return on first occurrence with a typed sentinel callers can match.

Why this matters

SalvageCheckpoint is the documented mechanism for bounded-RPO data salvage after a force_promote. With these two defects in place there is currently no supported way for an application to actually retrieve it — only milvus internals can read it via etcd. Every production deployment relying on Data Salvage as part of its DR posture will hit this on the day they actually need it.

In our own DR drill we worked around this by snapshotting the salvage checkpoint before the force_promote RPC (while the target is still in secondary state and GetReplicateInfo works). That snapshot-and-then-promote pattern should not be required; a healthy implementation would let business code pull the checkpoint at any time after force_promote.

Happy to provide further diagnostics or a draft PR — let us know which fix shape you'd prefer.

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugtriage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions