Bound HTTP timeouts in KafkaAgentClient to keep KafkaRoller responsive#12675
Bound HTTP timeouts in KafkaAgentClient to keep KafkaRoller responsive#12675chon3806 wants to merge 1 commit into
Conversation
b366d16 to
decc9a4
Compare
scholzj
left a comment
There was a problem hiding this comment.
Thanks for the PR. I left some comments.
| ## 1.1.0 | ||
|
|
||
| * _Nothing here yet, but we will surely develop something new pretty soon_ 😉 | ||
| * Add HTTP request and connect timeouts to the Kafka Agent client so that a broker stuck on IO can no longer block the rolling update. |
There was a problem hiding this comment.
This is a bugfix, not something we would track in a changelog.
There was a problem hiding this comment.
Reverted; CHANGELOG.md restored to the original placeholder.
| // Bounds the connect and full HTTP request lifecycle so that a broker which accepts the TCP connection but | ||
| // never produces a response (e.g. alive but stuck on IO) cannot block the KafkaRoller's single-threaded | ||
| // executor indefinitely. See https://github.com/strimzi/strimzi-kafka-operator/issues/12513. | ||
| /* test */ static final Duration HTTP_REQUEST_TIMEOUT = Duration.ofSeconds(30); |
There was a problem hiding this comment.
What is the logic for 30 seconds? Seems pretty long for what the KafkaAgent does. I also do not think we need to link the issue it is fixing (entiher here nor in tests).
There was a problem hiding this comment.
Reduced to 10 seconds — the Kafka Agent only serves a small broker-state JSON, so 10s is well above the expected response time on a healthy broker yet small enough to keep the roller responsive. Also dropped the issue URL from the comment here and from the test Javadocs; kept the rationale prose.
…cking KafkaAgentClient.getBrokerState() relies on a java.net.http.HttpClient that had neither a connect timeout nor a request timeout configured. When a broker is alive but stuck on IO (for example because the underlying storage has zero IOPS), TCP connection establishment succeeds but the Kafka Agent never produces an HTTP response. The call therefore blocks the KafkaRoller's single-threaded executor indefinitely, preventing any other broker from being processed and inflating reconciliation time. Add a bounded timeout (10s) on both the HttpClient connectTimeout and the HttpRequest timeout. The Kafka Agent only serves a small broker-state JSON, so 10 seconds is well above the expected response time on a healthy broker yet small enough to keep the roller responsive. On timeout the resulting HttpTimeoutException is wrapped as RuntimeException by doGet() and is already handled gracefully by getBrokerState(), which returns BrokerState(-1, null) and lets the roller move on to other brokers. Fixes strimzi#12513 Signed-off-by: chon3806 <93464148+chon3806@users.noreply.github.com>
decc9a4 to
cfbc8d4
Compare
| return httpClientBuilder() | ||
| .sslContext(sslContext) | ||
| .build(); |
There was a problem hiding this comment.
Sorry, I did not realize it before. But why not add the SSL context as a parameter and have it return the HTTP client instead of the half-finished builder?
There was a problem hiding this comment.
I understand these new methods make it easier to test but are they really necessary? It seems to me it is quite straight forward to set the timeout without additional methods that are called only once. Otherwise I agree with Jakub that it makes more sense that the method returns complete HTTP client, rather than finish building here.
There was a problem hiding this comment.
Taking into account that this method was written for adding tests which seem to be not that useful, I agree that we should just remove the method and configure timeout and SSLContext here.
There was a problem hiding this comment.
@chon3806 Are you happy with the suggestions in above comments? So you could set the timeout directly here instead of a separate method, for example:
return HttpClient.newBuilder()
.connectTimeout(HTTP_REQUEST_TIMEOUT)
.sslContext(sslContext)
.build();
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12675 +/- ##
=========================================
Coverage 75.07% 75.07%
Complexity 6514 6514
=========================================
Files 377 377
Lines 25092 25090 -2
Branches 3269 3269
=========================================
- Hits 18838 18837 -1
Misses 4913 4913
+ Partials 1341 1340 -1
🚀 New features to boost your workflow:
|
tinaselenge
left a comment
There was a problem hiding this comment.
Thanks for PR. I left a couple of comments, I wonder if the fix could be as simple as just setting the timeout.
| return httpClientBuilder() | ||
| .sslContext(sslContext) | ||
| .build(); |
There was a problem hiding this comment.
I understand these new methods make it easier to test but are they really necessary? It seems to me it is quite straight forward to set the timeout without additional methods that are called only once. Otherwise I agree with Jakub that it makes more sense that the method returns complete HTTP client, rather than finish building here.
| * requests built via the package-private helper carry the configured timeout. | ||
| */ | ||
| @Test | ||
| public void testBuildRequestAppliesHttpRequestTimeout() { |
There was a problem hiding this comment.
Similar to my other comment, I'm not sure if these tests are really that helpful. We set the timeout on the client and request, then testing if these timeouts are set here. If we want to make sure, we are not indefinitely waiting for a response, maybe simulating a slow response that exceeds the timeout and check for timeout error to really test the behaviour. But even that might be overkill?
There was a problem hiding this comment.
I think testing it like this might be tricky and questionable. What is it we want to test?
- That the JDK uses the timeouts we configured and does not just ignore them?
- Or that they help with the problem in Kafka / KafkaAgent?
For the first one I would argue it is probably not our concern. For the second one, it might be hard to actually make sure the test hangs in the right way how it would in production. So I wonder how reliably can we replicate the same situation.
There was a problem hiding this comment.
I'm not sure these new tests add much value and we set timeouts in many places for requests and we don't test them like that. So I was trying to suggest alternative that is more for your second point but wasn't anyway sure if that is the right thing to do either.
There was a problem hiding this comment.
Yeah, that is a fair point that the current test is probably not much useful. I'm just not sure how easy it is to make it useful.
There was a problem hiding this comment.
I agree that this test is not really useful, I left a similar comment for the other one.
| */ | ||
| @Test | ||
| public void testHttpClientBuilderAppliesConnectTimeout() { | ||
| HttpClient client = KafkaAgentClient.httpClientBuilder().build(); |
There was a problem hiding this comment.
In the end the httpClientBuilder() is just setting the timeout on the JDK HttpClient class so this test is only checking that the timeout is set properly but ... it's not our goal testing that a JDK class works imho, so I can't see this test really useful.
| * requests built via the package-private helper carry the configured timeout. | ||
| */ | ||
| @Test | ||
| public void testBuildRequestAppliesHttpRequestTimeout() { |
There was a problem hiding this comment.
I agree that this test is not really useful, I left a similar comment for the other one.
| return httpClientBuilder() | ||
| .sslContext(sslContext) | ||
| .build(); |
There was a problem hiding this comment.
Taking into account that this method was written for adding tests which seem to be not that useful, I agree that we should just remove the method and configure timeout and SSLContext here.
Type of change
Description
Fixes #12513.
KafkaAgentClient.getBrokerState()is invoked from the KafkaRoller's single-threaded executor whenever a broker fails the readinessawait. The underlyingjava.net.http.HttpClientandHttpRequestbuilders had no timeouts configured, so if a broker was alive but stuck on IO (for example because the underlying block storage had degraded to zero IOPS), the kernel TCP stack would still accept the connection but the Kafka Agent handler thread — also blocked on disk IO — would never produce a response. The HTTP call could therefore block the roller's single thread for the entire duration of the storage outage, preventing any other broker from being processed and significantly inflating reconciliation time.This change adds a bounded 10 second timeout on both
HttpClient.connectTimeout(...)andHttpRequest.timeout(...). When the timeout fires, the resultingHttpTimeoutException(subclass ofIOException) is caught by the existingIOExceptionhandler indoGet()and wrapped asRuntimeException.getBrokerState()already handles that case gracefully by returningBrokerState(-1, null), so the roller can move on to other brokers and retry on the next reconciliation. No behaviour change for healthy brokers.The Kafka Agent only serves a tiny broker-state JSON, so 10 seconds is well above the expected response latency on a healthy broker yet small enough to keep the single-threaded roller responsive (worst-case extra delay per stuck broker is
operationTimeoutMs+ 10 s instead ofoperationTimeoutMs+ unbounded). The constant is exposed package-private (/* test */) following the established convention in this module so a regression test can assert on it. The two builder helpers that apply it —httpClientBuilder()andbuildRequest(URI)— are likewise package-private so the wiring can be verified without a TLS identity or a live HTTP endpoint.Tests
KafkaAgentClientTest.testBuildRequestAppliesHttpRequestTimeout— asserts thatHttpRequests built through the package-privatebuildRequest(URI)helper carryHTTP_REQUEST_TIMEOUTas the per-request timeout.KafkaAgentClientTest.testHttpClientBuilderAppliesConnectTimeout— asserts thatHttpClients built through the package-privatehttpClientBuilder()helper carryHTTP_REQUEST_TIMEOUTas the connect timeout.The two builder helpers (
buildRequest,httpClientBuilder) are extracted as/* test */package-private to centralise the timeout policy in a single, named place and to make the wiring directly assertable without a TLS identity or a live HTTP endpoint.Verified locally with Temurin 21 + Maven 3.9.9:
mvn -pl cluster-operator -am -DskipTests install— BUILD SUCCESS, 0 Checkstyle violations across all 10 reactor modules.mvn -pl cluster-operator -Dtest=KafkaAgentClientTest test— 6/6 pass.mvn -pl cluster-operator -Dtest=KafkaQuorumCheckTest test— 14/14 pass (sibling class on the same critical path).Checklist
CHANGELOG.mdentry per maintainer review (this is a bugfix); no user-facing docs affected.Notes
The fix is the one suggested by the issue reporter (@dariocazas) and acknowledged by maintainers (@scholzj, @ppatierno, @katheris) as a sensible enhancement. No backport to 0.47.x is requested as part of this PR (per maintainer feedback on the issue).