selector/node/ewma: context.Canceled incorrectly penalises backend nodes — client cancellation is not a backend failure

## Summary

In `selector/node/ewma/node.go`, the `Pick()` done callback marks a node health as `success = 0` (fully degraded) when `di.Err` is `context.Canceled`:

```go
// selector/node/ewma/node.go
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}
```

`context.Canceled` means the **caller cancelled the request** — the user navigated away, the HTTP handler timed out on the ingress side, or an upstream context was cancelled. The backend node served the request correctly up until cancellation; its health is unrelated to why the client gave up.

Treating `context.Canceled` the same as `context.DeadlineExceeded` conflates two fundamentally different signals:

| Error | Source | Backend healthy? |
|---|---|---|
| `context.DeadlineExceeded` | Backend was too slow | Possibly not |
| `context.Canceled` | **Client cancelled** | Yes — unrelated to backend |

## Impact

In any system where clients frequently cancel requests — mobile apps, frontend with aggressive timeouts, load tests that stop mid-flight — the EWMA balancer will progressively lower the weight of healthy backends. In extreme cases (many concurrent cancellations), all nodes temporarily reach `success ≈ 0` and traffic shifts erratically.

The EWMA weight recovers via exponential decay (`tau = 600ms`), so the effect is transient per-node — but repeated cancellations cause sustained artificial bias against backends that happen to have in-flight requests at cancel time.

## Fix

Remove `context.Canceled` from the conditions that set `success = 0`. Only errors that indicate the **backend** failed to respond should penalise the node:

```go
// Before
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}

// After
if errors.Is(context.DeadlineExceeded, di.Err) ||
    // context.Canceled removed: the client cancelled the call — backend health is unaffected
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}
```

## Root principle

This is the same class of mistake as treating gRPC application-level status errors (NOT_FOUND, INVALID_ARGUMENT) as transport failures. Only errors that indicate the **connection or backend itself is broken** should degrade node health. Client-side cancellation says nothing about whether the backend is healthy.

## go-kratos version

v2.9.1 — confirmed in `selector/node/ewma/node.go`. The same pattern likely exists in any EWMA fork or P2C implementation that copied this logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

selector/node/ewma: context.Canceled incorrectly penalises backend nodes — client cancellation is not a backend failure #3834

Summary

Impact

Fix

Root principle

go-kratos version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Error	Source	Backend healthy?
`context.DeadlineExceeded`	Backend was too slow	Possibly not
`context.Canceled`	Client cancelled	Yes — unrelated to backend

Uh oh!

selector/node/ewma: context.Canceled incorrectly penalises backend nodes — client cancellation is not a backend failure #3834

Description

Summary

Impact

Fix

Root principle

go-kratos version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions