Skip to content

fix(breaker): avoid markDrop feedback loop for breaker's own rejection#5551

Open
bigrocs wants to merge 1 commit into
zeromicro:masterfrom
bigrocs:fix/breaker-markdrop-feedback-loop
Open

fix(breaker): avoid markDrop feedback loop for breaker's own rejection#5551
bigrocs wants to merge 1 commit into
zeromicro:masterfrom
bigrocs:fix/breaker-markdrop-feedback-loop

Conversation

@bigrocs

@bigrocs bigrocs commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

When googleBreaker rejects a request (ErrServiceUnavailable), calling markDrop() inflates bucket.Sum which feeds into history.total, creating a feedback loop that prevents the breaker from ever recovering.

Skip markDrop() when the error is the breaker's own rejection, since it is not a real backend failure and should not be counted in statistics.

When googleBreaker rejects a request (ErrServiceUnavailable), calling
markDrop() inflates bucket.Sum which feeds into history.total, creating
a feedback loop that prevents the breaker from ever recovering.

Skip markDrop() when the error is the breaker's own rejection, since
it is not a real backend failure and should not be counted in statistics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bigrocs

bigrocs commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

@kevwan Please ask Mr. Wan to merge this bug fix as soon as possible. It has caused significant issues for our high-concurrency payment scenarios.

@kevwan kevwan added kind/bug Something isn't working area/core labels Apr 30, 2026
@kevwan

kevwan commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Review

The fix correctly identifies and resolves a feedback loop in the Google SRE breaker algorithm.

Analysis:

The original code called markDrop() whenever accept() returned ErrServiceUnavailable — i.e., when the breaker itself was rejecting a request. This is semantically incorrect: markDrop() should only be called for actual backend failures, not for the breaker's own preemptive rejections.

The feedback loop (original behavior):

  1. Backend failures accumulate → breaker opens
  2. Subsequent requests are rejected by the breaker → markDrop() is called for each
  3. The sliding window fills with breaker-caused "failures" → K * success / total stays low
  4. Breaker never gets a chance to probe the backend → permanently stuck open

The fix: Skip markDrop() when err == ErrServiceUnavailable — the backend failure that triggered the open state was already counted. The window will naturally age out, and the breaker can recover on its own schedule.

Code quality:

  • ✅ The change is minimal and targeted
  • ✅ Applied consistently in both allow() and doReq()
  • ✅ The comments clearly explain the rationale

One consideration: Is there any scenario where ErrServiceUnavailable should be counted? E.g., if a custom Acceptable function is used — but in allow(), accept() returns ErrServiceUnavailable before any backend request is made, so skipping markDrop() here is always correct.

Suggestion (non-blocking): A unit test that demonstrates the recovery path when the breaker was previously stuck (i.e., without the fix the breaker would never recover) would strengthen confidence, but the logic is sound without it.

LGTM — this is a valid bug fix for a real production issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants