-
Notifications
You must be signed in to change notification settings - Fork 966
Add retry limiter support #6318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔍 Build Scan® (commit: ed307b2)
|
| } | ||
|
|
||
| if (!RetryLimiterExecutor.shouldRetry(limiter, ctx, currentAttemptNo)) { | ||
| return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have metrics about this but afaik it does not seem easy to add metrics here, maybe we could add some debug logging?
|
|
||
| final RetryConfig<HttpResponse> config = mappedRetryConfig(ctx); | ||
| if (!ctx.exchangeType().isResponseStreaming() || config.requiresResponseTrailers()) { | ||
| RetryLimiterExecutor.onCompletedAttempt(config.retryLimiter(), ctx, ctx.log().partial(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is the correct place to hook; it seems to work, but the logic is convoluted so maybe there is a better place to do it.
In fact I think that this should be in the handle of the aggregate but not 100%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that being convoluted; I am about to clean the retrying clients here: #6292 (on halt atm because I have exams)
So I think we can let that PR tackle this
In the end we want to have only one single call to onCompletedAttempt/ have it centralized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@friscoMad At this point where you put the onCompletedAttempt we just initiated the request. It is almost always not finished at this point. As it should be
Invoked when an attempt completes
we may want to put it as a listener to the whenCompleted future on the attempt response
| import com.linecorp.armeria.common.logging.RequestLog; | ||
|
|
||
| /** | ||
| * Executes {@link RetryLimiter} operations with proper exception handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During tests I found out that throwing inside the RateLimiter methods caused the client to never return anything; it just was blocked waiting on the future. This fixes the issue maybe we should do the same for the Retry filter functions.
| */ | ||
| @SuppressWarnings("MethodMayBeStatic") // Intentionally left non-static for better user experience. | ||
| protected final long getNextDelay(ClientRequestContext ctx, Backoff backoff, long millisAfterFromServer) { | ||
| protected final long getNextDelay(ClientRequestContext ctx, @Nullable RetryLimiter limiter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: AbstractRetryingClient is public so this and the change above are breaking changes. Let us note it in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But these are protected methods, do they also count as public api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this class is also not final people could inherit from it and use those methods. I cannot think of a use-case for it but theoretically it is possible. In the doc comments we are also not advising against it 🤷
| * @throws IllegalArgumentException if any argument is invalid | ||
| */ | ||
| public GrpcRetryLimiter(int maxTokens, int tokenRatio, int threshold, | ||
| Collection<Integer> retryableStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not increasing type safety by demanding a Set<Status> with Status from the gRPC package? You could get it by first getting the Status.Code with valueOf: https://grpc.github.io/grpc-java/javadoc/io/grpc/Status.Code.html#valueOf(java.lang.String). Then you can get the status with Status.fromCode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lives in the core package that does not depend on gRPC, and I think we want it that way. I could move the implementation to the grpc package, but then we can not have a RetryLimiter.ofGrpc, to me it makes more sense, but I tried to follow the pattern of the gRPC retry rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@friscoMad Maybe we can move GrpcRetryLimiter to the gRPC package and offer a static GrpcRetryLimiter.of factory method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, but I would like to hear from @ikhoon as he was the one asking for the builders on RetryLimiter class.
For example grpc retry rule is in core instead of the grpc package, which makes more sense, so I was not sure what to do.
| return -1; | ||
| } | ||
|
|
||
| if (!RetryLimiterExecutor.shouldRetry(limiter, ctx, currentAttemptNo)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in Armeria it is a convention to put the ctx as the first argument
| RetryConfig(RetryRule retryRule, int maxTotalAttempts, long responseTimeoutMillisForEachAttempt) { | ||
| this(requireNonNull(retryRule, "retryRule"), null, | ||
| maxTotalAttempts, responseTimeoutMillisForEachAttempt, 0); | ||
| RetryConfig(RetryRule retryRule, @Nullable RetryLimiter retryLimiter, int maxTotalAttempts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the change below are also breaking
| final int threshold; | ||
|
|
||
| /** | ||
| * 1000 times the tokenRatio field of the retryThrottling policy in service config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fact should also be documented in the constructor and class doc IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the class doc and updated here, not included in the constructor as it should be an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I misunderstood the scaling part: #6318 (comment)
| if (maxTokens <= 0 || tokenRatio <= 0 || threshold <= 0) { | ||
| throw new IllegalArgumentException("maxTokens, tokenRatio, and threshold must be positive: " + | ||
| "maxTokens=" + maxTokens + ", tokenRatio=" + tokenRatio + | ||
| ", threshold=" + threshold); | ||
| } | ||
| if (threshold > maxTokens) { | ||
| throw new IllegalArgumentException("threshold must be less than or equal to maxTokens: " + | ||
| threshold + " > " + maxTokens); | ||
| } | ||
| if (retryableStatuses == null || retryableStatuses.isEmpty()) { | ||
| throw new IllegalArgumentException("retryableStatuses cannot be null or empty"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see Armeria uses Preconditions and Objects.requireNonNull for null and range checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I didn't know about this, so I'll change it.
| return true; | ||
| } | ||
| } catch (Throwable t) { | ||
| logger.error("Failed to execute retry limiter", t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about also attaching numAttemptsSoFar and RetryLimiter.toString() (yet to be implemented) for providing more context? See also other method
| * @param maxTokens Initial token count | ||
| * @param tokenRatio Number of tokens a successful request increments | ||
| */ | ||
| public GrpcRetryLimiter(int maxTokens, int tokenRatio) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU the whole purpose of scaling up is for accepting floats right?
Then I think we need to
- accept
float - scale up everything
- convert to
int - then divide by 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the float on the constructors, ok, I will need to split the constructors and update some tests. I will do that tomorrow. Thanks for the heads up, I totally missed that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it should be fixed now!
| * | ||
| * <p>The implementation is heavily based on GRPC Java <a href="https://github.com/grpc/grpc-java/blob/94532a6b56076c56fb9278e9195bba1190a9260d/core/src/main/java/io/grpc/internal/RetriableStream.java#L1463">implementation</a>. | ||
| */ | ||
| public class GrpcRetryLimiter implements RetryLimiter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation tries to mimic gRPC Java as much as possible. There are 2 points that are not implemented.
gRPC java has support for pushback header, as that is not implemented in the RetryingClient, I have not added support for it here.
The other difference is that gRPC Java considers errors (for reducing tokens) every status defined in the hedging policy as nonFatalStatusCodes. We don't have hedging implemented, but the same behaviour can be replicated with the extended constructor passing the complete list of statuses.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6318 +/- ##
============================================
- Coverage 74.46% 74.13% -0.33%
- Complexity 22234 23030 +796
============================================
Files 1963 2065 +102
Lines 82437 86225 +3788
Branches 10764 11326 +562
============================================
+ Hits 61385 63924 +2539
- Misses 15918 16882 +964
- Partials 5134 5419 +285 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Refactor constructors Update javadoc
| * @param maxTokens the initial token count (must be positive) | ||
| * @param tokenRatio the number of tokens a successful request increments (must be positive) | ||
| */ | ||
| static RetryLimiter ofGrpc(int maxTokens, int tokenRatio) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Why is this not breaking the build? It should fail to compile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion from int to float is actually legal:
https://onecompiler.com/jshell/43rjpddqw
https://www.baeldung.com/java-primitive-conversions#widening-primitive-conversions
| * @param retryableStatuses the collection of gRPC status codes that are considered retryable(must not be | ||
| * null or empty) | ||
| */ | ||
| static RetryLimiter ofGrpc(int maxTokens, int tokenRatio, int threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass and only have a few nits
| } | ||
| } catch (Throwable t) { | ||
| logger.error( | ||
| "Failed to execute retry limiter, limiter: " + limiter + " attempts: " + numAttemptsSoFar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us make the error message different from the one below so user can easily separate them they filter their errors:
| "Failed to execute retry limiter, limiter: " + limiter + " attempts: " + numAttemptsSoFar, | |
| "Failed to execute RetryLimiter.onCompletedAttempt: limiter={}, attempts={}", limiter, numAttemptsSoFar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that we need implemented toString() methods for the limiters:
@Override
public String toString() {
return MoreObjects.toStringHelper(this)
.add("maxTokens", maxTokens)
.add("threshold", threshold)
.add("tokenRatio", tokenRatio)
.add("retryableStatuses", retryableStatuses)
.add("currentTokenCount", tokenCount.get())
.toString();
}@Override
public String toString() {
return MoreObjects.toStringHelper(this)
.add("permitsPerSecond", rateLimiter.getRate())
.toString();
}(did not test it)
| * @param permitsPerSecond the number of retry permits allowed per second; must be positive | ||
| */ | ||
| public RetryRateLimiter(double permitsPerSecond) { | ||
| rateLimiter = RateLimiter.create(permitsPerSecond); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us be explicit here and check for permitsPerSecond being positive ourselves
| } | ||
| } catch (Throwable t) { | ||
| logger.error( | ||
| "Failed to execute retry limiter, limiter: " + limiter + " attempts: " + numAttemptsSoFar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see comment above)
| "Failed to execute retry limiter, limiter: " + limiter + " attempts: " + numAttemptsSoFar, | |
| "Failed to execute RetryLimiter.shouldRetry: limiter={}, attempts={}", limiter, numAttemptsSoFar, |
| // Wait for 0.5 seconds to allow rate limiter to refill (2 permits per second = 0.5 seconds per permit) | ||
| Thread.sleep(600); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Wait for 0.5 seconds to allow rate limiter to refill (2 permits per second = 0.5 seconds per permit) | |
| Thread.sleep(600); | |
| // Wait for 0.5 seconds + some tolerance to allow rate limiter to refill (2 permits per second = 0.5 seconds per permit) | |
| Thread.sleep(500 + 100); |
| // Wait for 2 seconds to allow rate limiter to refill (0.5 permits per second = 2 seconds per permit) | ||
| try { | ||
| Thread.sleep(2100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Wait for 2 seconds to allow rate limiter to refill (0.5 permits per second = 2 seconds per permit) | |
| try { | |
| Thread.sleep(2100); | |
| // Wait for 2 seconds + some tolerance to allow rate limiter to refill (0.5 permits per second = 2 seconds per permit) | |
| try { | |
| Thread.sleep(2000 + 100); |
| * and doing nothing for {@link #onCompletedAttempt}). | ||
| * </p> | ||
| */ | ||
| public final class RetryLimiterExecutor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not increase the API surface with this class, i.e. would remove public
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
If we are accepting the breaking changes then we should release this PR together with #6292 to prevent two successive breaking API changes of the same API after each other |
| import com.linecorp.armeria.common.logging.RequestLog; | ||
|
|
||
| /** | ||
| * A strategy interface for limiting retries in Armeria clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an implementor I would want to have clarity how thread-safe my implementation needs to be
https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I made it more splicit, let me know if it is better now
| executor.submit(() -> { | ||
| try { | ||
| for (int j = 0; j < attemptsPerThread; j++) { | ||
| if (limiter.shouldRetry(ctx, j + 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the tests were generated with AI support, and even when I reviewed and changed several parts, I skipped some of those eccentricities 😄
But addressed.
Fixed RetryLimiter.of constructors Added test for RetryLimited.of constructors Improved logs Added toString methods Checked Rate limiter param
Add toString docs
schiemon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the incremental reviews but found a few important points here😆
| } | ||
| // Extract the headers to be able to evaluate the gRPC status | ||
| // Check HTTP trailers first, because most gRPC responses have non-empty payload + trailers. | ||
| HttpHeaders maybeGrpcTrailers = requestLog.responseTrailers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the request log does not have trailers, we will throw an RequestLogAvailabilityException and not return null. You can check for its existence with log.isAvailable(RequestLogProperty.RESPONSE_HEADERS, RequestLogProperty.RESPONSE_TRAILERS). Then you can also remove the cause check above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed 👍
| final int incremented = currentCount + tokenRatio; | ||
| updated = tokenCount.compareAndSet(currentCount, Math.min(incremented, maxTokens)); | ||
| } | ||
| if (updated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If you put !updated in the while-loop head then you can remove this if here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
|
|
||
| final RetryConfig<HttpResponse> config = mappedRetryConfig(ctx); | ||
| if (!ctx.exchangeType().isResponseStreaming() || config.requiresResponseTrailers()) { | ||
| RetryLimiterExecutor.onCompletedAttempt(config.retryLimiter(), ctx, ctx.log().partial(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@friscoMad At this point where you put the onCompletedAttempt we just initiated the request. It is almost always not finished at this point. As it should be
Invoked when an attempt completes
we may want to put it as a listener to the whenCompleted future on the attempt response
| HttpResponse originalRes) { | ||
| // Notify the retry limiter that this attempt has completed | ||
| final RetryConfig<HttpResponse> config = mappedRetryConfig(ctx); | ||
| RetryLimiterExecutor.onCompletedAttempt(config.retryLimiter(), ctx, derivedCtx.log().partial(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this but then you also have to cover the exceptional case with handleException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test case for this would be great
jrhee17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the late review, some rough thoughts I had while taking an initial look:
-
RetryConfig#requiresResponseTrailersdictates whether to wait for trailers before deciding whether to retry or not. Given the current implementation, I'm unsure ifGrpcRetryLimiterwill be able to consistently retrieve grpc trailers for streaming responses. (e.g. IfRetryRule.RetryRule.onUnprocessed()is used,RetryLimiter#onCompletedAttemptmay be invoked before the trailers arrive for a streaming response). In general (and not just gRPC), I guess users will have to keep track of what properties will be available atRetryLimiterbased on how theRetryConfig#requiresResponseTrailersis set. -
When determining whether a retry should be done or not, if users would like to retry on a certain
grpc-status, it seems like he/she should set both theGrpcRetryLimiterand theRetryRulesince they seem to have an AND relationship. (if they would like to useGrpcRetryLimiter). In general, I'm still trying to organize the exact distinctionRetryRuleandRetryLimiter.
e.g.
RetryRule.builder()
.onGrpcTrailers((ctx, trailers) -> trailers.containsInt("grpc-status", 15))
...
new GrpcRetryLimiter(0, 0, 0, List.of(15))
- Maybe nit, but it seems like
RetryLimiter#shouldRetrymay be called even if a retry won't actually execute (e.g. due tosetResponseTimeoutchecks, request abortion, etc.). While it may not be a big deal, this may be a little awkward for theRetryRateLimiterimplementation which acquires.
I'm nitpicking because I'm still trying to grasp the full picture at the moment.
I would also like to do some research on gRPC's implementation (as well as envoy's) to have a better idea on retry budgets in general before leaving more comments.
Yes, and this is difficult even for me trying to understand how the code works, I am not sure if there is a reason for having a gRPC retry limiter without a gRPC retry rule, but maybe we can make some changes to make that decision explicit in RetryLimiter and enable
RetryRule decides if a retry shoud be done, retry limiter can block a retry based on whatever rules are implemented (like max retries in flight, max retries per minute, token bucket...) the retry limiter only is triggered if a retry is going to be done.
This is a valid concern and I am happy to move the check to another point in the timeline, I have a lot of difficulties to fully understand the flow and find where to add the 2 hooks (
|
jrhee17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still organizing my thoughts, but I think overall looks good. Left some minor comments (mostly to show that I'm not forgetting about this PR)
| */ | ||
| @Override | ||
| public boolean shouldRetry(ClientRequestContext ctx, int numAttemptsSoFar) { | ||
| return rateLimiter.tryAcquire(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add a callback method onStartAttempt to RateLimiter which indicates a request will be sent.
For each invocation of onStartAttempt a matching onCompletedAttempt is guaranteed to be called.
I think shouldRetry is best left stateless - I imagine this will be useful not only in retries, but hedging later on as well since we'll need to check asynchronously.
The additional onStartAttempt would also help implement generic rate limiting retry budgets like done in envoy.
ref: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds-retrybudget
| * | ||
| * <p>The implementation is heavily based on GRPC Java <a href="https://github.com/grpc/grpc-java/blob/94532a6b56076c56fb9278e9195bba1190a9260d/core/src/main/java/io/grpc/internal/RetriableStream.java#L1463">implementation</a>. | ||
| */ | ||
| public class GrpcRetryLimiter implements RetryLimiter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; GrpcRetryLimiter and its ctors can be package-private
| */ | ||
| final int tokenRatio; | ||
|
|
||
| final Set<String> retryableStatuses; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note) Understood the reason behind having a separate set of conditions here is because RetryDecision can't convey if a retry should count as being limited or not.
e.g. in pseudocode
if cause instanceOf UnprocessedRequestException
-> RetryDecision(backoff), should transparently retry without updating `RetryLimiter`
if grpc header received
-> RetryDecision#NoRetry, should update `RetryLimiter#onSuccess`
if trailers-only
-> RetryDecision(backoff), should retry and update `RetryLimiter#onFailure`
An alternative design could be to add a throttlingDecision flag to RetryDecision.
e.g.
class RetryDecision {
enum ThrottleDecision {
THROTTLE,
NO_THROTTLE,
IGNORE,
}
private final ThrottleDecision throttleDecision;By doing so, the boundary between each interface could be clearer:
- RetryRule: responsible for retry rules (whether to retry/throttle depending no a set of rules)
- RetryLimiter: responsible for providing the throttling algorithm (time-based, bucket-based, etc..)
This seems like a design choice, and I'm fine with the current design as well - but wanted to at least mention options before moving forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not decide on throttling at the same time as we are deciding if we are going to retry or not, as we could be cancelling the retry later and we don't want to "lock/waste" resources.
Ideally throttling check should be done at the very last possible point in time before sending the retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea wasn't to implement throttling based on RetryRule - RetryLimiter#canRetry can still determine whether a retry should be throttled or not.
The idea was rather to add a property to RetryDecision which allows users to specify whether a retryDecision should affect throttling
final Set<String> retryableStatusCodes = ImmutableSet.of("14");
RetryRule rule = new RetryRule() {
@Override
public CompletionStage<RetryDecision> shouldRetry(ClientRequestContext ctx,
@Nullable Throwable cause) {
final RequestLog headers = ctx.log().getIfAvailable(RequestLogProperty.RESPONSE_HEADERS);
if (cause instanceof UnprocessedRequestException || headers == null) {
return RetryDecision.of(backoff, NO_THROTTLE);
}
final String status = headers.responseHeaders().get("grpc-status");
if (status == null) {
return RetryDecision.noRetry(THROTTLE_SUCCESS);
} else if (!retryableStatusCodes.contains(status)) {
return RetryDecision.of(backoff, NO_THROTTLE);
} else {
return RetryDecision.of(backoff, THROTTLE_FAILURE);
}
}
};and users can still decide on how to limit retries based on the throttling decision
interface RetryLimiter {
boolean shouldRetry(ClientRequestContext ctx, int numAttemptsSoFar);
default void onCompletedAttempt(RetryDecision decision, ClientRequestContext ctx, RequestLog requestLog, int numAttemptsSoFar);
}Updated API idea
RetryConfig.builder(RetryRule.builder().onUnprocessed().thenBackoff(backoff, ThrottleDecision.NONE)
.orElse(RetryRule.builder()
.onResponseHeaders(noGrpcStatus)
.thenNoRetry(ThrottleDecision.SUCCESS))
.orElse(RetryRule.builder()
.onGrpcTrailers(isRetriableGrpcStatus)
.thenBackoff(backoff, ThrottleDecision.FAILURE)))
.build();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh Ok, I understand your idea, but it seems more complex and require more coupling.
Afaik, usually both subsystems can be configured independently, and while in general solutions only provide a single algorithm for retry and retry budget/throttling, so there is more space for coupling, but in Armeria it seems that you are open to creating custom implementations, so I think decoupling is better, that said I understand the benefits of doing something like that for the gRPC rule/limitter as the official implementation is more complex than other limiters, maybe we can fix that (checking the status twice, setting retriable statuses in both cases) by using some context magic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can fix that (checking the status twice, setting retriable statuses in both cases) by using some context magic?
I don't have an idea to do this with the current APIs, but if users can specify retryableStatusCodes once and have it applied to both the retry rule and limiter, I have no issue with the current API proposal.
e.g. users don't have to specify statuses in both RetryRule and RetryLimiter
Set<Integer> retriableStatuses = ImmutableSet.of();
RetryRule customRetryRule = RetryRule.onUnprocessed()
.orElse(RetryRule.builder()
.onGrpcTrailers((ctx, headers) -> retriableStatuses.contains(headers.getInt("grpc-status")))
.thenBackoff());
RetryConfig.builder(customRetryRule)
.limiter(RetryLimiter.ofGrpc(1, 1, 1, retriableStatuses)).build();In general (not just gRPC), I think it's possible that users will want to apply throttling for only a subset of responses (e.g. 429 too many requests, 503 gateway error, etc..) so I think if there is a general solution for this, it would be great as well so RetryRateLimiter can also benefit from this.
Also, I'm not sure if we'll want the flexibility for users to also throttle based on response content (e.g. imagining cases where server returns the exception in the response {"error": "ConflictException"}, {"error": "ThrottledException"}) which I think is slightly less common. I'm curious if the current proposed API can support this use-case (or if it should be supported at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other maintainers suggested that @ikhoon review this PR first. Since he'll be back next week, let me hold further reviews until he comes back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly lean toward making RetryLimiter work independently to achieve decoupling.
RetryRule customRetryRule = RetryRule.onUnprocessed() .orElse(RetryRule.builder() .onGrpcTrailers((ctx, headers) -> > retriableStatuses.contains(headers.getInt("grpc-status"))) .thenBackoff()); RetryConfig.builder(customRetryRule) .limiter(RetryLimiter.ofGrpc(1, 1, 1, retriableStatuses)).build();In general (not just gRPC), I think it's possible that users will want to apply throttling for only a subset of responses (e.g. 429 too many requests, 503 gateway error, etc..) so I think if there is a general solution for this, it would be great as well so RetryRateLimiter can also benefit from this.
In this case, the default implementation, which decreases the quota for each attempt, may cover it.
// The default `RetryLimiter` respects the retry decision made in customRetryRule
RetryConfig.builder(customRetryRule)
.limiter(RetryLimiter.ofDefault(maxTokens, ...)).build();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if you don't think this is inconvenient for users, feel free to proceed with the current implementation. Let me revisit this PR later once my review comments are addressed
Motivation:
As we discussed over discord there is currently no support for add proper retry limiting, this PR adds it with a couple of working retry limiter implementations, one based on GRPC logic and another that uses a RateLimiter.
Modifications:
Result: