Skip to content

selector/node/ewma: context.Canceled incorrectly penalises backend nodes — client cancellation is not a backend failure #3834

@akpradheeph

Description

@akpradheeph

Summary

In selector/node/ewma/node.go, the Pick() done callback marks a node health as success = 0 (fully degraded) when di.Err is context.Canceled:

// selector/node/ewma/node.go
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}

context.Canceled means the caller cancelled the request — the user navigated away, the HTTP handler timed out on the ingress side, or an upstream context was cancelled. The backend node served the request correctly up until cancellation; its health is unrelated to why the client gave up.

Treating context.Canceled the same as context.DeadlineExceeded conflates two fundamentally different signals:

Error Source Backend healthy?
context.DeadlineExceeded Backend was too slow Possibly not
context.Canceled Client cancelled Yes — unrelated to backend

Impact

In any system where clients frequently cancel requests — mobile apps, frontend with aggressive timeouts, load tests that stop mid-flight — the EWMA balancer will progressively lower the weight of healthy backends. In extreme cases (many concurrent cancellations), all nodes temporarily reach success ≈ 0 and traffic shifts erratically.

The EWMA weight recovers via exponential decay (tau = 600ms), so the effect is transient per-node — but repeated cancellations cause sustained artificial bias against backends that happen to have in-flight requests at cancel time.

Fix

Remove context.Canceled from the conditions that set success = 0. Only errors that indicate the backend failed to respond should penalise the node:

// Before
if errors.Is(context.DeadlineExceeded, di.Err) || errors.Is(context.Canceled, di.Err) ||
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}

// After
if errors.Is(context.DeadlineExceeded, di.Err) ||
    // context.Canceled removed: the client cancelled the call — backend health is unaffected
    errors.IsServiceUnavailable(di.Err) || errors.IsGatewayTimeout(di.Err) || errors.As(di.Err, &netErr) {
    success = 0
}

Root principle

This is the same class of mistake as treating gRPC application-level status errors (NOT_FOUND, INVALID_ARGUMENT) as transport failures. Only errors that indicate the connection or backend itself is broken should degrade node health. Client-side cancellation says nothing about whether the backend is healthy.

go-kratos version

v2.9.1 — confirmed in selector/node/ewma/node.go. The same pattern likely exists in any EWMA fork or P2C implementation that copied this logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions