Summary
In selector/node/ewma/node.go, the Pick() done callback marks a node health as success = 0 (fully degraded) when di.Err is context.Canceled:
// selector/node/ewma/node.go
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
success = 0
}
context.Canceled means the caller cancelled the request — the user navigated away, the HTTP handler timed out on the ingress side, or an upstream context was cancelled. The backend node served the request correctly up until cancellation; its health is unrelated to why the client gave up.
Treating context.Canceled the same as context.DeadlineExceeded conflates two fundamentally different signals:
| Error |
Source |
Backend healthy? |
context.DeadlineExceeded |
Backend was too slow |
Possibly not |
context.Canceled |
Client cancelled |
Yes — unrelated to backend |
Impact
In any system where clients frequently cancel requests — mobile apps, frontend with aggressive timeouts, load tests that stop mid-flight — the EWMA balancer will progressively lower the weight of healthy backends. In extreme cases (many concurrent cancellations), all nodes temporarily reach success ≈ 0 and traffic shifts erratically.
The EWMA weight recovers via exponential decay (tau = 600ms), so the effect is transient per-node — but repeated cancellations cause sustained artificial bias against backends that happen to have in-flight requests at cancel time.
Fix
Remove context.Canceled from the conditions that set success = 0. Only errors that indicate the backend failed to respond should penalise the node:
// Before
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
success = 0
}
// After
if errors.Is(context.DeadlineExceeded, di.Err) ||
// context.Canceled removed: the client cancelled the call — backend health is unaffected
errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
success = 0
}
Root principle
This is the same class of mistake as treating gRPC application-level status errors (NOT_FOUND, INVALID_ARGUMENT) as transport failures. Only errors that indicate the connection or backend itself is broken should degrade node health. Client-side cancellation says nothing about whether the backend is healthy.
go-kratos version
v2.9.1 — confirmed in selector/node/ewma/node.go. The same pattern likely exists in any EWMA fork or P2C implementation that copied this logic.
Summary
In
selector/node/ewma/node.go, thePick()done callback marks a node health assuccess = 0(fully degraded) whendi.Erriscontext.Canceled:context.Canceledmeans the caller cancelled the request — the user navigated away, the HTTP handler timed out on the ingress side, or an upstream context was cancelled. The backend node served the request correctly up until cancellation; its health is unrelated to why the client gave up.Treating
context.Canceledthe same ascontext.DeadlineExceededconflates two fundamentally different signals:context.DeadlineExceededcontext.CanceledImpact
In any system where clients frequently cancel requests — mobile apps, frontend with aggressive timeouts, load tests that stop mid-flight — the EWMA balancer will progressively lower the weight of healthy backends. In extreme cases (many concurrent cancellations), all nodes temporarily reach
success ≈ 0and traffic shifts erratically.The EWMA weight recovers via exponential decay (
tau = 600ms), so the effect is transient per-node — but repeated cancellations cause sustained artificial bias against backends that happen to have in-flight requests at cancel time.Fix
Remove
context.Canceledfrom the conditions that setsuccess = 0. Only errors that indicate the backend failed to respond should penalise the node:Root principle
This is the same class of mistake as treating gRPC application-level status errors (NOT_FOUND, INVALID_ARGUMENT) as transport failures. Only errors that indicate the connection or backend itself is broken should degrade node health. Client-side cancellation says nothing about whether the backend is healthy.
go-kratos version
v2.9.1 — confirmed in
selector/node/ewma/node.go. The same pattern likely exists in any EWMA fork or P2C implementation that copied this logic.