Skip to content

fix: retry HTTP requests on transient network errors and 5xx#2779

Closed
lacostej wants to merge 2 commits intolichess-org:mainfrom
lacostej:fix/http-retry-resilience
Closed

fix: retry HTTP requests on transient network errors and 5xx#2779
lacostej wants to merge 2 commits intolichess-org:mainfrom
lacostej:fix/http-retry-resilience

Conversation

@lacostej
Copy link
Copy Markdown
Contributor

Depends on #2777 — review that PR first; this one adds one commit on top.

Summary

  • Extend RetryClient to retry once (after 500ms) on SocketException, TimeoutException, and 5xx server errors, in addition to the existing 429 retry
  • Override lichessClientProvider in makeOfflineTestProviderScope so the offline mock (which throws SocketException) bypasses RetryClient and doesn't create pending timers in FakeAsync-based tests
  • Add 4 dedicated unit tests for retry behavior using fakeAsync for instant execution

Context

The offline test mock (offlineClient) throws SocketException on every request. Without the test fix, RetryClient would catch and retry these, creating delay timers that outlive FakeAsync and fail the test. The fix is to bypass RetryClient in offline tests — those tests verify UI behavior when the network is down, not retry logic. Retry is now tested separately.

Test plan

  • All 747 tests pass (743 existing + 4 new)
  • No test time regression (31s, same as main)
  • New tests verify: retry on SocketException, retry on 5xx, no retry on 4xx, error propagation after retries exhausted

Adds a default request timeout in LichessClient.send() to prevent
requests from hanging indefinitely, which can cause screens to stay
stuck in loading state (e.g. friends screen).

Fixes lichess-org#2724
Extend RetryClient to retry once (after 500ms) on SocketException,
TimeoutException, and 5xx server errors, in addition to 429.

Override lichessClientProvider in makeOfflineTestProviderScope to
bypass RetryClient, since the offline mock throws SocketException
which would otherwise trigger retries and leave pending timers in
FakeAsync-based tests.

Add dedicated unit tests for retry behavior: success after transient
error, success after 5xx, no retry on 4xx, and error propagation
after retries exhausted.
@lacostej lacostej force-pushed the fix/http-retry-resilience branch from 8dd2feb to 1bdbd85 Compare March 16, 2026 21:33
retries: 1,
delay: _defaultDelay,
when: (response) => response.statusCode == 429,
when: (response) => response.statusCode == 429 || response.statusCode >= 500,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so sure about the utility of retrying on all 5xx server errors, even though it should not cause any harm. Also not sure about retrying on SocketException, as most of the time it means the network is off.

Also FYI there is a global retry mechanism in the riverpod layer:

if (error is ServerException && error.statusCode != 503) return null;

Of course the client can be used outside a provider, but the most common case is inside a provider. Ideally we should try to avoid both retry mechanisms to conflict with each other and unnecessary requests.

Not sure what is the best approach here. We could handle the network retries in the client only which means we need to improve the logic to filter out network errors in the riverpod retry callback. It could be useful to throw a custom HttpTimeoutException in that case.

@lacostej
Copy link
Copy Markdown
Contributor Author

Thanks for the review and the pointer to the Riverpod retry — I had completely missed it.

Any reason 429 and 503 are handled differently, at different layers? RetryClient owns 429 retry (once, 900ms), while the Riverpod retry callback handles 503 (up to 5x, exponential backoff). It might make sense to consolidate these in the same place.

I think I need more input on what types of issues the infra is actually facing, and how in the long run the project wants to handle retries across client & server — maybe adding more advanced solutions like Retry-After headers, etc.

That's the reason I implemented this PR on top of #2777 — I thought the first one (request timeout) would be merged faster :)

@veloce
Copy link
Copy Markdown
Contributor

veloce commented Mar 18, 2026

I owe you a more detailed explanation indeed, as it is not properly documented in the code.

The situation we have now is partly due to the fact riverpod had no retry mechanism when I first implemented the http client. Then the riverpod 3.0 refactoring was huge and I didn't want to delay it too much by dealing with that.

Now there is a real reason why 429 and 503 are handled differently.

Lichess API has IP based rate limits and will return 429 when the limit is reached. While the app should avoid to reach the limits, there are a lot of concurrent requests and it can still (rarely) happen. In that case we decided with the server devs that retrying once was enough (cc @ornicar).

Now lichess API do not use 503 AFAIK. I assume 503 can be returned by other network layers, and also our http client targets other hosts like the opening explorer, the CDN, etc.

The http retry is currently configured for the lichess client only (the one that targets lichess URIs) and only deals with the 429 responses. I think it should stay like that.

The riverpod retry with exponential backoff is more suitable to handle the other types of errors (it can be network errors but not only).
For now the logic is very simple and limited and could definitely be improved. 503 can be handled here, but the question remains for other 5xx errors (I don't think it makes sense to retry them).
We could catch here and retry requests that reach the global client timeout. And add even more logic here if needed?

The goal of this PR could be to improve the documentation and add some doc comments to relevant places in the code. And it could also improve the riverpod retry logic. But the http client retry logic should remain the same.

@lacostej
Copy link
Copy Markdown
Contributor Author

Having this detailed explanation was very helpful. I'll see what I'll make of all of this. I'll close this PR for now. Thanks a lot.

@lacostej lacostej closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants