Skip to content

Bound HTTP timeouts in KafkaAgentClient to keep KafkaRoller responsive#12675

Open
chon3806 wants to merge 1 commit into
strimzi:mainfrom
chon3806:fix-12513-kafka-agent-client-http-timeout
Open

Bound HTTP timeouts in KafkaAgentClient to keep KafkaRoller responsive#12675
chon3806 wants to merge 1 commit into
strimzi:mainfrom
chon3806:fix-12513-kafka-agent-client-http-timeout

Conversation

@chon3806
Copy link
Copy Markdown

@chon3806 chon3806 commented Apr 26, 2026

Type of change

  • Bugfix

Description

Fixes #12513.

KafkaAgentClient.getBrokerState() is invoked from the KafkaRoller's single-threaded executor whenever a broker fails the readiness await. The underlying java.net.http.HttpClient and HttpRequest builders had no timeouts configured, so if a broker was alive but stuck on IO (for example because the underlying block storage had degraded to zero IOPS), the kernel TCP stack would still accept the connection but the Kafka Agent handler thread — also blocked on disk IO — would never produce a response. The HTTP call could therefore block the roller's single thread for the entire duration of the storage outage, preventing any other broker from being processed and significantly inflating reconciliation time.

This change adds a bounded 10 second timeout on both HttpClient.connectTimeout(...) and HttpRequest.timeout(...). When the timeout fires, the resulting HttpTimeoutException (subclass of IOException) is caught by the existing IOException handler in doGet() and wrapped as RuntimeException. getBrokerState() already handles that case gracefully by returning BrokerState(-1, null), so the roller can move on to other brokers and retry on the next reconciliation. No behaviour change for healthy brokers.

The Kafka Agent only serves a tiny broker-state JSON, so 10 seconds is well above the expected response latency on a healthy broker yet small enough to keep the single-threaded roller responsive (worst-case extra delay per stuck broker is operationTimeoutMs + 10 s instead of operationTimeoutMs + unbounded). The constant is exposed package-private (/* test */) following the established convention in this module so a regression test can assert on it. The two builder helpers that apply it — httpClientBuilder() and buildRequest(URI) — are likewise package-private so the wiring can be verified without a TLS identity or a live HTTP endpoint.

Tests

  • KafkaAgentClientTest.testBuildRequestAppliesHttpRequestTimeout — asserts that HttpRequests built through the package-private buildRequest(URI) helper carry HTTP_REQUEST_TIMEOUT as the per-request timeout.
  • KafkaAgentClientTest.testHttpClientBuilderAppliesConnectTimeout — asserts that HttpClients built through the package-private httpClientBuilder() helper carry HTTP_REQUEST_TIMEOUT as the connect timeout.

The two builder helpers (buildRequest, httpClientBuilder) are extracted as /* test */ package-private to centralise the timeout policy in a single, named place and to make the wiring directly assertable without a TLS identity or a live HTTP endpoint.

Verified locally with Temurin 21 + Maven 3.9.9:

  • mvn -pl cluster-operator -am -DskipTests install — BUILD SUCCESS, 0 Checkstyle violations across all 10 reactor modules.
  • mvn -pl cluster-operator -Dtest=KafkaAgentClientTest test — 6/6 pass.
  • mvn -pl cluster-operator -Dtest=KafkaQuorumCheckTest test — 14/14 pass (sibling class on the same critical path).

Checklist

  • Write tests
  • Make sure all tests pass — verified locally (see above) for the impacted test classes; full project CI handles the rest.
  • Update documentation — no CHANGELOG.md entry per maintainer review (this is a bugfix); no user-facing docs affected.
  • Check RBAC rights for Kubernetes / OpenShift roles — n/a, no RBAC change.
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally — n/a, change is internal to the Cluster Operator's HTTP client and exercised via unit tests.
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md — intentionally not updated (bugfix, per maintainer review).
  • Supply screenshots for visual changes — n/a.

Notes

The fix is the one suggested by the issue reporter (@dariocazas) and acknowledged by maintainers (@scholzj, @ppatierno, @katheris) as a sensible enhancement. No backport to 0.47.x is requested as part of this PR (per maintainer feedback on the issue).

@chon3806 chon3806 force-pushed the fix-12513-kafka-agent-client-http-timeout branch 2 times, most recently from b366d16 to decc9a4 Compare April 26, 2026 03:20
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I left some comments.

Comment thread CHANGELOG.md Outdated
## 1.1.0

* _Nothing here yet, but we will surely develop something new pretty soon_ 😉
* Add HTTP request and connect timeouts to the Kafka Agent client so that a broker stuck on IO can no longer block the rolling update.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix, not something we would track in a changelog.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted; CHANGELOG.md restored to the original placeholder.

// Bounds the connect and full HTTP request lifecycle so that a broker which accepts the TCP connection but
// never produces a response (e.g. alive but stuck on IO) cannot block the KafkaRoller's single-threaded
// executor indefinitely. See https://github.com/strimzi/strimzi-kafka-operator/issues/12513.
/* test */ static final Duration HTTP_REQUEST_TIMEOUT = Duration.ofSeconds(30);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic for 30 seconds? Seems pretty long for what the KafkaAgent does. I also do not think we need to link the issue it is fixing (entiher here nor in tests).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 10 seconds — the Kafka Agent only serves a small broker-state JSON, so 10s is well above the expected response time on a healthy broker yet small enough to keep the roller responsive. Also dropped the issue URL from the comment here and from the test Javadocs; kept the rationale prose.

@scholzj scholzj requested review from katheris and tinaselenge April 26, 2026 20:56
@scholzj scholzj added this to the 1.1.0 milestone Apr 26, 2026
…cking

KafkaAgentClient.getBrokerState() relies on a java.net.http.HttpClient that had neither a connect timeout nor a request timeout configured. When a broker is alive but stuck on IO (for example because the underlying storage has zero IOPS), TCP connection establishment succeeds but the Kafka Agent never produces an HTTP response. The call therefore blocks the KafkaRoller's single-threaded executor indefinitely, preventing any other broker from being processed and inflating reconciliation time.

Add a bounded timeout (10s) on both the HttpClient connectTimeout and the HttpRequest timeout. The Kafka Agent only serves a small broker-state JSON, so 10 seconds is well above the expected response time on a healthy broker yet small enough to keep the roller responsive. On timeout the resulting HttpTimeoutException is wrapped as RuntimeException by doGet() and is already handled gracefully by getBrokerState(), which returns BrokerState(-1, null) and lets the roller move on to other brokers.

Fixes strimzi#12513

Signed-off-by: chon3806 <93464148+chon3806@users.noreply.github.com>
@chon3806 chon3806 force-pushed the fix-12513-kafka-agent-client-http-timeout branch from decc9a4 to cfbc8d4 Compare April 27, 2026 01:19
Comment on lines +101 to 103
return httpClientBuilder()
.sslContext(sslContext)
.build();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not realize it before. But why not add the SSL context as a parameter and have it return the HTTP client instead of the half-finished builder?

Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these new methods make it easier to test but are they really necessary? It seems to me it is quite straight forward to set the timeout without additional methods that are called only once. Otherwise I agree with Jakub that it makes more sense that the method returns complete HTTP client, rather than finish building here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account that this method was written for adding tests which seem to be not that useful, I agree that we should just remove the method and configure timeout and SSLContext here.

Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chon3806 Are you happy with the suggestions in above comments? So you could set the timeout directly here instead of a separate method, for example:

 return HttpClient.newBuilder()
                    .connectTimeout(HTTP_REQUEST_TIMEOUT)
                    .sslContext(sslContext)
                    .build();

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.07%. Comparing base (a2534e7) to head (cfbc8d4).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...or/cluster/operator/resource/KafkaAgentClient.java 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #12675   +/-   ##
=========================================
  Coverage     75.07%   75.07%           
  Complexity     6514     6514           
=========================================
  Files           377      377           
  Lines         25092    25090    -2     
  Branches       3269     3269           
=========================================
- Hits          18838    18837    -1     
  Misses         4913     4913           
+ Partials       1341     1340    -1     
Files with missing lines Coverage Δ
...or/cluster/operator/resource/KafkaAgentClient.java 41.66% <77.77%> (+9.52%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ppatierno ppatierno self-requested a review April 28, 2026 07:50
Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for PR. I left a couple of comments, I wonder if the fix could be as simple as just setting the timeout.

Comment on lines +101 to 103
return httpClientBuilder()
.sslContext(sslContext)
.build();
Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these new methods make it easier to test but are they really necessary? It seems to me it is quite straight forward to set the timeout without additional methods that are called only once. Otherwise I agree with Jakub that it makes more sense that the method returns complete HTTP client, rather than finish building here.

* requests built via the package-private helper carry the configured timeout.
*/
@Test
public void testBuildRequestAppliesHttpRequestTimeout() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment, I'm not sure if these tests are really that helpful. We set the timeout on the client and request, then testing if these timeouts are set here. If we want to make sure, we are not indefinitely waiting for a response, maybe simulating a slow response that exceeds the timeout and check for timeout error to really test the behaviour. But even that might be overkill?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing it like this might be tricky and questionable. What is it we want to test?

  • That the JDK uses the timeouts we configured and does not just ignore them?
  • Or that they help with the problem in Kafka / KafkaAgent?

For the first one I would argue it is probably not our concern. For the second one, it might be hard to actually make sure the test hangs in the right way how it would in production. So I wonder how reliably can we replicate the same situation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these new tests add much value and we set timeouts in many places for requests and we don't test them like that. So I was trying to suggest alternative that is more for your second point but wasn't anyway sure if that is the right thing to do either.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a fair point that the current test is probably not much useful. I'm just not sure how easy it is to make it useful.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this test is not really useful, I left a similar comment for the other one.

*/
@Test
public void testHttpClientBuilderAppliesConnectTimeout() {
HttpClient client = KafkaAgentClient.httpClientBuilder().build();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end the httpClientBuilder() is just setting the timeout on the JDK HttpClient class so this test is only checking that the timeout is set properly but ... it's not our goal testing that a JDK class works imho, so I can't see this test really useful.

* requests built via the package-private helper carry the configured timeout.
*/
@Test
public void testBuildRequestAppliesHttpRequestTimeout() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this test is not really useful, I left a similar comment for the other one.

Comment on lines +101 to 103
return httpClientBuilder()
.sslContext(sslContext)
.build();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account that this method was written for adding tests which seem to be not that useful, I agree that we should just remove the method and configure timeout and SSLContext here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KafkaAgentClient.getBrokerState() has no HTTP timeout, blocks KafkaRoller single-threaded executor indefinitely

4 participants