feat: client-side load balancing #1744

tank-500m · 2025-12-24T09:38:06Z

Summary

This PR fixes #1208 by making the client-side load balancing dial strategies (round_robin/random) reliable under high concurrency.
Strategy-driven dialing is now governed by the idle connection cap (MaxIdleConns) and uses an atomic reservation to prevent concurrent acquires from over-dialing beyond the cap.

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

CLAassistant · 2025-12-24T09:38:13Z

All committers have signed the CLA.

tank-500m · 2026-01-07T10:25:38Z

clickhouse.go

 	return r, err
 }

+func (ch *clickhouse) shouldDialForStrategy(currentConns int) bool {


using a mutex might be safer.

kavirajk · 2026-01-07T10:28:23Z

Hi @tank-500m thanks for the PR.

I'm more interested about your exact use-case for using client-side load-balancing here? (instead of server side LB).

Are you trying to ingest/query from two complete ClickHouse setup with different URLs? or these just some replicas of same ClickHouse setup? (in that case why not server side LB? is it using TCP?)

tank-500m · 2026-01-07T10:40:04Z

Hi! @kavirajk

We are shifting our ingestion strategy from using Distributed tables to writing directly into Local tables. This change requires us to handle load balancing to distribute the data properly.

Regarding the server-side LB, our infrastructure team has advised against using an intermediate Load Balancer for this setup, so we need to handle it on the client side. And yes, we are currently use TCP.

+) these are replicas of the same ClickHouse setup.

kavirajk · 2026-01-07T17:30:19Z

@tank-500m thanks for the context.

my concern is the changes looks like ignoring the connection pooling entirely?

Look at this gist that I used for testing.

You will notice, even though, we acquire connections between just two address (round-robin), we are dialing "new" connection every single time, even though there are room for idle connections in the pool.

tank-500m · 2026-01-07T18:05:58Z

@kavirajk

You’re understanding it correctly.

I first implemented it following the approach suggested here:
#1208 (comment)

I’m open to changing this behavior if needed.

However, I’m not sure whether it’s correct to decide pooling eligibility based on whether each connection’s destination address is unique (i.e., treating connections as interchangeable only when they target the same resolved endpoint). It’s hard for me to judge what the “right” rule should be here.

I’d appreciate your thoughts. Thanks.

Are you concerned about something along the lines of what’s described here (#1135 (comment))?

kavirajk · 2026-01-09T08:48:33Z

However, I’m not sure whether it’s correct to decide pooling eligibility based on whether each connection’s destination address is unique (i.e., treating connections as interchangeable only when they target the same resolved endpoint). It’s hard for me to judge what the “right” rule should be here.

I’d appreciate your thoughts. Thanks.

My concern with not respecting connection pool is this can lead to lots of latency and performance issues. The main goal is to avoid dial for every single time when client want's to talk to some server instance.

But this may need some refactoring on the connection pool side. Ideally this is how I envision

If there are multiple (say 2) addresses to LB, then connection pool should have connections to both addresses (respecting maxIdleConns and maxOpenConns of course).
And for every query, we dial only if "no connections" to that address exists (or connection expired, etc).
If connections exists, we LB from the connections from the pool (without needing to establish brand new connection)

Currently the actual round-robin only happens during DialStrategy (I guess that's why you added dial everytime). But we may also need to integrate this to acquire method including when it's trying to pick connections from idlePool.

Another important think here is, the failover. Say one of the address is failing, can we not fail every other request (when using round robin)?.

I know you wanted to merge small PR :) But I'm just trying to generalize the solution so that it works all the time without any performance impact.

Any thoughts?

tank-500m · 2026-01-10T03:54:28Z

@kavirajk
Thanks for the detailed :)
There’s a lot more to think about than I expected...

My understanding is that you’re essentially proposing following “scene 1” from this comment (#1208 (comment)) as the baseline, and then add some kind of mitigation for addresses that fail—e.g., temporarily excluding a bad address (or backing off from it for N minutes) so don’t end up failing every other request in round-robin.
Is that correct?

kavirajk · 2026-01-12T09:56:07Z

My understanding is that you’re essentially proposing following “scene 1” from this comment (#1208 (comment)) as the baseline, and then add some kind of mitigation for addresses that fail—e.g., temporarily excluding a bad address (or backing off from it for N minutes) so don’t end up failing every other request in round-robin.
Is that correct?

That's correct :) The connection pooling and robust failovers are something crucial IMO, if we want to go in this direction. Although I'm not sure how complicated for the refactoring and implementation. Happy to guide to my knowledge if you want to pick that up.

Also my other thought is, given the complexity of this design, for "production" use cases, better to set up proper proxy (like nginx) for even TCP load balancing. It's not that hard I did in on this PR for my testing. .

our infrastructure team has advised against using an intermediate Load Balancer for this setup,

May I know what's the rationale behind not going with proxy approach?. Asking because to add it to my list "use-cases" to have nicer client-side load balancing on go client.

tank-500m · 2026-01-12T22:58:41Z

@kavirajk

Although I'm not sure how complicated for the refactoring and implementation. Happy to guide to my knowledge if you want to pick that up.

Thanks! if we end up moving forward with this, I think I’ll need quite a bit of help. I really appreciate it.

Also my other thought is, given the complexity of this design, for "production" use cases, better to set up proper proxy

One concern I have with load balancing via NGINX is that, with TCP (native protocol), it becomes connection-based. In that case, I’m not sure the traffic distribution will behave as we expect.
Using HTTP could address that, but I expect it would require a fair amount of customization/operational tuning, and since we haven’t run it in production, I’m not fully confident about the details yet.

May I know what's the rationale behind not going with proxy approach?. Asking because to add it to my list "use-cases" to have nicer client-side load balancing on go client.

I’ll check with our infra team and get back to you. Thanks again.

tank-500m · 2026-01-18T15:54:57Z

@kavirajk
sorry I'm late.

our infra team generally recommends using a Load Balancer (preferably DSR) for download traffic.
However, INSERT workload is upload. If we rely on an LB to do per-request load distribution for this kind of heavy ingest traffic, it can put significant pressure on the LB itself, so it’s not recommended on our side.

And If the LB only spreads connections (rather than distributing individual requests), it still seems hard to avoid hotspots unless the client maintains a sufficiently large number of concurrent connections.

tank-500m added 2 commits December 24, 2025 16:38

feat: Improve client-side load balancing

199ec01

fix: cap concurrent strategy dials and add test

c08b4bf

laeg added the enhancement label Jan 6, 2026

Merge branch 'main' into loadbalance-concurrency

a13edb8

tank-500m commented Jan 7, 2026

View reviewed changes

clickhouse.go

return r, err

}

func (ch *clickhouse) shouldDialForStrategy(currentConns int) bool {

Copy link

Author

tank-500m Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a mutex might be safer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: client-side load balancing #1744

feat: client-side load balancing #1744

tank-500m commented Dec 24, 2025

Uh oh!

CLAassistant commented Dec 24, 2025 •

edited

Loading

Uh oh!

tank-500m Jan 7, 2026

Uh oh!

kavirajk commented Jan 7, 2026

Uh oh!

tank-500m commented Jan 7, 2026 •

edited

Loading

Uh oh!

kavirajk commented Jan 7, 2026

Uh oh!

tank-500m commented Jan 7, 2026 •

edited

Loading

Uh oh!

kavirajk commented Jan 9, 2026

Uh oh!

tank-500m commented Jan 10, 2026

Uh oh!

kavirajk commented Jan 12, 2026

Uh oh!

tank-500m commented Jan 12, 2026

Uh oh!

tank-500m commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: client-side load balancing #1744

Are you sure you want to change the base?

feat: client-side load balancing #1744

Conversation

tank-500m commented Dec 24, 2025

Summary

Checklist

Uh oh!

CLAassistant commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tank-500m Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kavirajk commented Jan 7, 2026

Uh oh!

tank-500m commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kavirajk commented Jan 7, 2026

Uh oh!

tank-500m commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kavirajk commented Jan 9, 2026

Uh oh!

tank-500m commented Jan 10, 2026

Uh oh!

kavirajk commented Jan 12, 2026

Uh oh!

tank-500m commented Jan 12, 2026

Uh oh!

tank-500m commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Dec 24, 2025 •

edited

Loading

tank-500m commented Jan 7, 2026 •

edited

Loading

tank-500m commented Jan 7, 2026 •

edited

Loading