fix: retry HTTP requests on transient network errors and 5xx by lacostej · Pull Request #2779 · lichess-org/mobile

lacostej · 2026-03-16T20:47:49Z

Depends on #2777 — review that PR first; this one adds one commit on top.

Summary

Extend RetryClient to retry once (after 500ms) on SocketException, TimeoutException, and 5xx server errors, in addition to the existing 429 retry
Override lichessClientProvider in makeOfflineTestProviderScope so the offline mock (which throws SocketException) bypasses RetryClient and doesn't create pending timers in FakeAsync-based tests
Add 4 dedicated unit tests for retry behavior using fakeAsync for instant execution

Context

The offline test mock (offlineClient) throws SocketException on every request. Without the test fix, RetryClient would catch and retry these, creating delay timers that outlive FakeAsync and fail the test. The fix is to bypass RetryClient in offline tests — those tests verify UI behavior when the network is down, not retry logic. Retry is now tested separately.

Test plan

All 747 tests pass (743 existing + 4 new)
No test time regression (31s, same as main)
New tests verify: retry on SocketException, retry on 5xx, no retry on 4xx, error propagation after retries exhausted

Adds a default request timeout in LichessClient.send() to prevent requests from hanging indefinitely, which can cause screens to stay stuck in loading state (e.g. friends screen). Fixes lichess-org#2724

Extend RetryClient to retry once (after 500ms) on SocketException, TimeoutException, and 5xx server errors, in addition to 429. Override lichessClientProvider in makeOfflineTestProviderScope to bypass RetryClient, since the offline mock throws SocketException which would otherwise trigger retries and leave pending timers in FakeAsync-based tests. Add dedicated unit tests for retry behavior: success after transient error, success after 5xx, no retry on 4xx, and error propagation after retries exhausted.

veloce · 2026-03-17T09:17:02Z

lib/src/network/http.dart

      retries: 1,
      delay: _defaultDelay,
-      when: (response) => response.statusCode == 429,
+      when: (response) => response.statusCode == 429 || response.statusCode >= 500,


Not so sure about the utility of retrying on all 5xx server errors, even though it should not cause any harm. Also not sure about retrying on SocketException, as most of the time it means the network is off.

Also FYI there is a global retry mechanism in the riverpod layer:

mobile/lib/main.dart

Line 45 in e65f046

if (error is ServerException && error.statusCode != 503) return null;

Of course the client can be used outside a provider, but the most common case is inside a provider. Ideally we should try to avoid both retry mechanisms to conflict with each other and unnecessary requests.

Not sure what is the best approach here. We could handle the network retries in the client only which means we need to improve the logic to filter out network errors in the riverpod retry callback. It could be useful to throw a custom HttpTimeoutException in that case.

lacostej · 2026-03-17T16:15:42Z

Thanks for the review and the pointer to the Riverpod retry — I had completely missed it.

Any reason 429 and 503 are handled differently, at different layers? RetryClient owns 429 retry (once, 900ms), while the Riverpod retry callback handles 503 (up to 5x, exponential backoff). It might make sense to consolidate these in the same place.

I think I need more input on what types of issues the infra is actually facing, and how in the long run the project wants to handle retries across client & server — maybe adding more advanced solutions like Retry-After headers, etc.

That's the reason I implemented this PR on top of #2777 — I thought the first one (request timeout) would be merged faster :)

veloce · 2026-03-18T09:43:18Z

I owe you a more detailed explanation indeed, as it is not properly documented in the code.

The situation we have now is partly due to the fact riverpod had no retry mechanism when I first implemented the http client. Then the riverpod 3.0 refactoring was huge and I didn't want to delay it too much by dealing with that.

Now there is a real reason why 429 and 503 are handled differently.

Lichess API has IP based rate limits and will return 429 when the limit is reached. While the app should avoid to reach the limits, there are a lot of concurrent requests and it can still (rarely) happen. In that case we decided with the server devs that retrying once was enough (cc @ornicar).

Now lichess API do not use 503 AFAIK. I assume 503 can be returned by other network layers, and also our http client targets other hosts like the opening explorer, the CDN, etc.

The http retry is currently configured for the lichess client only (the one that targets lichess URIs) and only deals with the 429 responses. I think it should stay like that.

The riverpod retry with exponential backoff is more suitable to handle the other types of errors (it can be network errors but not only).
For now the logic is very simple and limited and could definitely be improved. 503 can be handled here, but the question remains for other 5xx errors (I don't think it makes sense to retry them).
We could catch here and retry requests that reach the global client timeout. And add even more logic here if needed?

The goal of this PR could be to improve the documentation and add some doc comments to relevant places in the code. And it could also improve the riverpod retry logic. But the http client retry logic should remain the same.

lacostej · 2026-03-18T14:11:44Z

Having this detailed explanation was very helpful. I'll see what I'll make of all of this. I'll close this PR for now. Thanks a lot.

fix: add 15-second timeout to all HTTP requests

7cda7bd

Adds a default request timeout in LichessClient.send() to prevent requests from hanging indefinitely, which can cause screens to stay stuck in loading state (e.g. friends screen). Fixes lichess-org#2724

lacostej mentioned this pull request Mar 16, 2026

Friends screen doesn't load #2724

Open

lacostej force-pushed the fix/http-retry-resilience branch from 8dd2feb to 1bdbd85 Compare March 16, 2026 21:33

veloce reviewed Mar 17, 2026

View reviewed changes

lacostej closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: retry HTTP requests on transient network errors and 5xx#2779

fix: retry HTTP requests on transient network errors and 5xx#2779
lacostej wants to merge 2 commits intolichess-org:mainfrom
lacostej:fix/http-retry-resilience

lacostej commented Mar 16, 2026

Uh oh!

veloce Mar 17, 2026

Uh oh!

lacostej commented Mar 17, 2026

Uh oh!

veloce commented Mar 18, 2026

Uh oh!

lacostej commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lacostej commented Mar 16, 2026

Summary

Context

Test plan

Uh oh!

veloce Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

lacostej commented Mar 17, 2026

Uh oh!

veloce commented Mar 18, 2026

Uh oh!

lacostej commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants