fix: retry HTTP requests on transient network errors and 5xx#2779
fix: retry HTTP requests on transient network errors and 5xx#2779lacostej wants to merge 2 commits intolichess-org:mainfrom
Conversation
Adds a default request timeout in LichessClient.send() to prevent requests from hanging indefinitely, which can cause screens to stay stuck in loading state (e.g. friends screen). Fixes lichess-org#2724
Extend RetryClient to retry once (after 500ms) on SocketException, TimeoutException, and 5xx server errors, in addition to 429. Override lichessClientProvider in makeOfflineTestProviderScope to bypass RetryClient, since the offline mock throws SocketException which would otherwise trigger retries and leave pending timers in FakeAsync-based tests. Add dedicated unit tests for retry behavior: success after transient error, success after 5xx, no retry on 4xx, and error propagation after retries exhausted.
8dd2feb to
1bdbd85
Compare
| retries: 1, | ||
| delay: _defaultDelay, | ||
| when: (response) => response.statusCode == 429, | ||
| when: (response) => response.statusCode == 429 || response.statusCode >= 500, |
There was a problem hiding this comment.
Not so sure about the utility of retrying on all 5xx server errors, even though it should not cause any harm. Also not sure about retrying on SocketException, as most of the time it means the network is off.
Also FYI there is a global retry mechanism in the riverpod layer:
Line 45 in e65f046
Of course the client can be used outside a provider, but the most common case is inside a provider. Ideally we should try to avoid both retry mechanisms to conflict with each other and unnecessary requests.
Not sure what is the best approach here. We could handle the network retries in the client only which means we need to improve the logic to filter out network errors in the riverpod retry callback. It could be useful to throw a custom HttpTimeoutException in that case.
|
Thanks for the review and the pointer to the Riverpod retry — I had completely missed it. Any reason 429 and 503 are handled differently, at different layers? I think I need more input on what types of issues the infra is actually facing, and how in the long run the project wants to handle retries across client & server — maybe adding more advanced solutions like That's the reason I implemented this PR on top of #2777 — I thought the first one (request timeout) would be merged faster :) |
|
I owe you a more detailed explanation indeed, as it is not properly documented in the code. The situation we have now is partly due to the fact riverpod had no retry mechanism when I first implemented the http client. Then the riverpod 3.0 refactoring was huge and I didn't want to delay it too much by dealing with that. Now there is a real reason why 429 and 503 are handled differently. Lichess API has IP based rate limits and will return 429 when the limit is reached. While the app should avoid to reach the limits, there are a lot of concurrent requests and it can still (rarely) happen. In that case we decided with the server devs that retrying once was enough (cc @ornicar). Now lichess API do not use 503 AFAIK. I assume 503 can be returned by other network layers, and also our http client targets other hosts like the opening explorer, the CDN, etc. The http retry is currently configured for the lichess client only (the one that targets lichess URIs) and only deals with the 429 responses. I think it should stay like that. The riverpod retry with exponential backoff is more suitable to handle the other types of errors (it can be network errors but not only). The goal of this PR could be to improve the documentation and add some doc comments to relevant places in the code. And it could also improve the riverpod retry logic. But the http client retry logic should remain the same. |
|
Having this detailed explanation was very helpful. I'll see what I'll make of all of this. I'll close this PR for now. Thanks a lot. |
Summary
RetryClientto retry once (after 500ms) onSocketException,TimeoutException, and 5xx server errors, in addition to the existing 429 retrylichessClientProviderinmakeOfflineTestProviderScopeso the offline mock (which throwsSocketException) bypassesRetryClientand doesn't create pending timers inFakeAsync-based testsfakeAsyncfor instant executionContext
The offline test mock (
offlineClient) throwsSocketExceptionon every request. Without the test fix,RetryClientwould catch and retry these, creating delay timers that outliveFakeAsyncand fail the test. The fix is to bypassRetryClientin offline tests — those tests verify UI behavior when the network is down, not retry logic. Retry is now tested separately.Test plan
SocketException, retry on 5xx, no retry on 4xx, error propagation after retries exhausted