[Bug]: SalvageCheckpoint feature unreachable after force-promote (GetReplicateInfo handler + retry policy)

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Environment

```markdown
- Milvus version: v2.6.18 (verified). The defect path is unchanged since v2.6.9 (force-promote + storage v2 era).
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka (also affects pulsar / woodpecker — the bug is in the proxy handler, not MQ-specific)
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 3.1.0rc8 (any version that exposes GetReplicateInfo)
- OS(Ubuntu or CentOS): Ubuntu 22.04
- Helm chart: milvus/milvus 5.0.22 (appVersion 2.6.18)
```

### Current Behavior

**The `SalvageCheckpoint` feature is functionally unreachable from any client in v2.6.x.** Two distinct defects compound — both surface as "GetReplicateInfo hangs to deadline-exceeded after force-promote", but the root causes are independent and each is worth fixing on its own.

---

#### Defect 1 — `GetReplicateInfo` handler returns early on standalone primary, blocking SalvageCheckpoint access

Affected location: `internal/proxy/impl.go` — `Proxy.GetReplicateInfo` handler.

The handler calls `GetReplicateCheckpoint` first, which is **expected to fail** in standalone-primary state ("wal is not a secondary cluster in replicating topology"). On that failure the handler returns immediately. The follow-up `GetSalvageCheckpoint` call — the entire reason `SalvageCheckpoint` exists — is never reached.

```go
// internal/proxy/impl.go (current master / v2.6.18)
checkpoint, err := streaming.WAL().Replicate().GetReplicateCheckpoint(ctx, req.GetTargetPchannel())
if err != nil {
    return nil, err   // ← standalone primary fails here, GetSalvageCheckpoint is never called
}

// dead code on a standalone-primary cluster
salvageCheckpoints, err := streaming.WAL().Replicate().GetSalvageCheckpoint(ctx, req.GetTargetPchannel())
```

This is the only client-facing API that exposes `salvage_checkpoint`. No alternative entry point exists in the proxy. So after `force_promote`, the SalvageCheckpoint that milvus dutifully persisted to etcd can never be read back by business code — which is exactly when it's needed.

#### Defect 2 — Client retries `STREAMING_CODE_REPLICATE_VIOLATION` to deadline

Affected location: `internal/streamingnode/client/handler/handler_client_impl.go` — `createHandlerAfterStreamingNodeReady`.

When the streamingnode returns `STREAMING_CODE_REPLICATE_VIOLATION` ("wal is not a secondary cluster in replicating topology"), the client treats it as a transient error and re-creates the handler in a tight loop until the caller's context cancels.

This error is **permanent for the lifetime of the current WAL role** — no amount of retrying will make a standalone-primary WAL transition back to secondary. Callers see a deadline-exceeded after their full timeout, with no actionable error to programme against.

User-visible symptom: `GetReplicateInfo` and any other call going through this client path appears to "hang" rather than fail clearly. This is what makes Defect 1 read like "the API is broken" rather than "this state isn't supported".

Streamingnode warn logs during the retry storm:
```
[WARN] [handler/handler_client_impl.go:293] ["create handler failed"]
    [pchannel=...] [handler="replicate checkpoint"]
    [error="/milvus.proto.streaming.StreamingNodeHandlerService/GetReplicateCheckpoint;
      streaming error: code = STREAMING_CODE_REPLICATE_VIOLATION,
      cause = wal is not a secondary cluster in replicating topology; rpc error: code = Unknown, desc = "]
```
…repeated every few hundred ms for the entire client timeout window.

### Expected Behavior

1. `GetReplicateInfo` on a standalone-primary cluster should still return `salvage_checkpoint`. That's the whole point of persisting it across the force-promote transition.
2. `STREAMING_CODE_REPLICATE_VIOLATION` should surface as a clear, non-retriable error within RTT, not after a full client deadline.

### Steps To Reproduce

```
1. Stand up two milvus 2.6.18 clusters A (primary) and B (standby), 16 pchannels each, cluster replication via UpdateReplicateConfiguration.
2. Push some writes through A; let CDC advance.
3. Force-promote B:
       UpdateReplicateConfigurationRequest{
         replicate_configuration: { clusters: [B], cross_cluster_topology: [] },
         force_promote: true,
       }
4. Call GetReplicateInfo on B:
       GetReplicateInfoRequest{
         source_cluster_id: "<A's cluster_id>",
         target_pchannel: "<B's pchannel name>",
       }
5. Observe:
   - Client log fills with `create handler failed` lines carrying STREAMING_CODE_REPLICATE_VIOLATION.
   - RPC eventually returns DEADLINE_EXCEEDED (caller-side timeout).
   - `salvage_checkpoint` is never returned — even though milvus did write it to etcd under `streamingcoord-meta/salvage-checkpoint/<source_id>/<pchannel>`.
```

### Anything else?

#### Suggested fix

**Defect 1** — any of these, in order of minimum diff:
1. Treat `GetReplicateCheckpoint` failure as non-fatal in `GetReplicateInfo`; continue to `GetSalvageCheckpoint` and return whatever subset is available (with `checkpoint=nil` if live checkpoint is unreachable).
2. Reorder: call `GetSalvageCheckpoint` first; only call `GetReplicateCheckpoint` if the target is still in secondary state.
3. Expose a dedicated `GetSalvageCheckpoint` proxy RPC alongside `GetReplicateInfo`. This is the cleanest API-wise but the largest change.

Option 1 is backwards-compatible and a few lines.

**Defect 2** — classify `STREAMING_CODE_REPLICATE_VIOLATION` as a permanent error in `createHandlerAfterStreamingNodeReady`'s retry policy. Return on first occurrence with a typed sentinel callers can match.

#### Why this matters

`SalvageCheckpoint` is the documented mechanism for bounded-RPO data salvage after a `force_promote`. With these two defects in place there is currently **no supported way** for an application to actually retrieve it — only milvus internals can read it via etcd. Every production deployment relying on Data Salvage as part of its DR posture will hit this on the day they actually need it.

In our own DR drill we worked around this by snapshotting the salvage checkpoint **before** the `force_promote` RPC (while the target is still in secondary state and `GetReplicateInfo` works). That snapshot-and-then-promote pattern should not be required; a healthy implementation would let business code pull the checkpoint at any time after `force_promote`.

Happy to provide further diagnostics or a draft PR — let us know which fix shape you'd prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: SalvageCheckpoint feature unreachable after force-promote (GetReplicateInfo handler + retry policy) #50344

Is there an existing issue for this?

Environment

Current Behavior

Defect 1 — `GetReplicateInfo` handler returns early on standalone primary, blocking SalvageCheckpoint access

Defect 2 — Client retries `STREAMING_CODE_REPLICATE_VIOLATION` to deadline

Expected Behavior

Steps To Reproduce

Anything else?

Suggested fix

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: SalvageCheckpoint feature unreachable after force-promote (GetReplicateInfo handler + retry policy) #50344

Description

Is there an existing issue for this?

Environment

Current Behavior

Defect 1 — GetReplicateInfo handler returns early on standalone primary, blocking SalvageCheckpoint access

Defect 2 — Client retries STREAMING_CODE_REPLICATE_VIOLATION to deadline

Expected Behavior

Steps To Reproduce

Anything else?

Suggested fix

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Defect 1 — `GetReplicateInfo` handler returns early on standalone primary, blocking SalvageCheckpoint access

Defect 2 — Client retries `STREAMING_CODE_REPLICATE_VIOLATION` to deadline