-
Notifications
You must be signed in to change notification settings - Fork 966
GH-6372: Expose client request metrics. #6502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9062ebf to
54dc0e3
Compare
| /** | ||
| * Collects simple client-side metrics such as: | ||
| * <ul> | ||
| * <li>The number of pending requests per {@link SessionProtocol}</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question) If the objective is to determine the optimal number of event loops depending on pending requests for each endpoint group/protocol, would it be enough to check the duration instead?
final EndpointGroup endpointGroup = ctx.endpointGroup();
ctx.log().whenComplete().thenAccept(log -> {
final long pendingDuration = log.connectionTimings().pendingAcquisitionDurationNanos();
final SessionProtocol sessionProtocol = log.sessionProtocol();
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhee17 thanks for your comments!
Your comment also make sense to me!
However, IMHO, it may not be enough or enough. 😄
So I would like to vote for counting the number of pending requests.
The pendingAcquisitionDurationNanos seems to be set after request get channel.
In that case, pendingAcquisitionDurationNanos can be -1 or large value.
If the acquisition of a channel is delayed due to a sudden surge in requests, the value of pendingAcquisitionDurationNanos will continuously remain -1. Or, if the channel is acquired after a very long time, the value of pendingAcquisitionDurationNanos will change from -1 to a very large value. Therefore, the responsiveness of the operation that increases the EventLoop is likely to be degraded.
Additionally, when the Channel is Busy, we need to consider two states:
- The value is -1.
- The value is an enormously large number.
Consequently, the ClientMetrics object would need to contain code to account for this, which potentially implies that ClientMetrics is tightly coupled with the logic of ConnectionTimings.
For these reasons, I vote for counting the number of pending requests.
What do you think?
Also, @ikhoon , please give your opinion when you have time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using pending duration is also a good idea. However, adding ClientMetrics would be worthwhile because:
ClientMetricsitself seems to provide useful information at theClientFactorylevel.- Using
ConnectionTimingsinsidemaxNumEventLoopsFunctionmethod doesn't look straightforward as it may require additional processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I don't think I can meaningfully review this PR at this point since I'm not sure from the issue/PR description how ClientMetrics is expected to be used.
Do you envision ClientMetrics as a general metric collector that collects metrics on the overall Client (or ClientFactory)? What would be the relationship between this metric collector and other metrics that are already being exported?
Otherwise, it might help if the class name were more specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most APIs that are exposed at the request level will primarily be used for logging or collected by metric collectors such as Prometheus. Since they contain per-request details, it does not seem easy to use them directly to control the server’s runtime behavior.
ServerMetrics was first exposed as a Java API because it was difficult to obtain information about in-flight requests at runtime when implementing a custom graceful shutdown. Similarly, ClientMetrics will expose the information needed to control runtime behavior at the client or client-factory level. I expect these values to be exposed mostly as simple counters rather than histograms.
Otherwise, it might help if the class name were more specific.
I’m open to changing it if you think there’s a better name.
| // EndpointGroup does not override equals() and hashCode(). | ||
| // Call sites must use the same 'EndpointGroup' instance when invoking | ||
| // 'incrementActiveRequest(...)' and 'decrementActiveRequest'. | ||
| private final ConcurrentMap<EndpointGroup, LongAdder> activeRequestsPerEndpointGroup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question) Couldn't we use Endpoint instead of EndpointGroup as the key to collect metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikhoon nim, thanks for your comments.
Addressed it!
Motivation:
Because of #6372 .
Modifications:
ClientMetrics.HttpClientFactoryhasClientMetricsas its field.HttpChannelPoolhasClientMetricsas its field.HttpChannelPoolcallsClientMetricswhenever callingsetPendingAcquisition(...)andremovePendingAcquisition(...).HttpSessionHandlercallsClientMetricsto increment count for active request and reserve to decrement count of active request.Result: