Skip to content

Retry transient HTTP errors (502/503/504) and network errors#155

Open
0ca wants to merge 1 commit into
DopplerHQ:masterfrom
0ca:fix/retry-transient-http-errors
Open

Retry transient HTTP errors (502/503/504) and network errors#155
0ca wants to merge 1 commit into
DopplerHQ:masterfrom
0ca:fix/retry-transient-http-errors

Conversation

@0ca
Copy link
Copy Markdown

@0ca 0ca commented Feb 17, 2026

Summary

  • Adds retry logic for transient non-JSON HTTP errors (502, 503, 504) that were previously failing permanently
  • Extends network error retries to cover EOF and connection reset errors, not just timeouts

Problem

When Terraform refreshes state for many Doppler resources in parallel (~60+ resources), the API occasionally returns:

  • 502 Bad Gateway from Cloudflare — HTML error page (~6443 bytes), not JSON
  • EOF / connection reset errors from transient network issues

The current code at api.go:185 returns these errors with RetryAfter: nil, causing the retry loop to exit immediately. The error surfaces as:

Error: Doppler Error: Unable to load response
502 status code; 6439 bytes

Root Cause

Two issues in PerformRequest():

  1. Non-JSON error responses (line 185): When the API returns a non-JSON body (e.g., Cloudflare's HTML 502 page), RetryAfter is never set, so the retry loop gives up immediately
  2. Network errors (lines 130-136): Only net.Error timeouts trigger a retry. EOF and connection reset errors get RetryAfter: nil and fail permanently

Reproduction

We reproduced this using a Go program that mirrors the provider's exact HTTP client configuration (DisableKeepAlives: true, new http.Client per request, 30s timeout). Running 60 concurrent requests every 3 seconds:

  • Captured a 502 after ~6000 requests (response: 6443 bytes, Cloudflare HTML)
  • Response header: Proxy-Status: Cloudflare-Proxy;error=http_response_incomplete — the origin sent an incomplete response
  • Also observed: 18-22 timeouts per 10-minute run, same class of transient error

Fix

  • isTransientStatusCode(): Returns true for 502, 503, 504
  • isTransientError(): Returns true for timeouts, EOF, connection reset, connection refused
  • Both non-JSON HTTP errors and network errors now set RetryAfter for transient cases, enabling the existing retry loop (up to 10 attempts with 1s backoff)

What this does NOT change

  • Retry loop structure and MAX_RETRIES (10) unchanged
  • JSON error handling path unchanged
  • 429 rate limit handling unchanged
  • HTTP client config (DisableKeepAlives, timeouts) unchanged

Test plan

  • go build ./... compiles clean
  • go test ./... passes (no existing tests broken)
  • Reproduced 502 with exact provider HTTP client config before fix
  • Manual testing with large Terraform state refresh

🤖 Generated with Claude Code

When Terraform refreshes state for many resources in parallel, the Doppler
API occasionally returns transient 502 errors from Cloudflare (HTML error
pages, ~6443 bytes) or transient network errors (EOF, connection reset).

Previously, non-JSON error responses had RetryAfter set to nil, causing
the retry loop in PerformRequestWithRetry to exit immediately. Similarly,
only net.Error timeouts were retried, while EOF and connection reset errors
failed permanently.

This change:
- Adds retry with 1s backoff for 502, 503, 504 status codes regardless
  of response content type (these are typically infrastructure-level
  transient errors from load balancers/proxies)
- Extends network error retries to cover EOF and connection reset errors,
  not just timeouts

Reproduced by running parallel API requests matching the Terraform provider's
exact HTTP client config (DisableKeepAlives: true, new client per request).
Captured a Cloudflare 502 with Proxy-Status: error=http_response_incomplete,
confirming the origin occasionally sends incomplete responses under load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant