-
Notifications
You must be signed in to change notification settings - Fork 629
feat: client-side load balancing #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| return r, err | ||
| } | ||
|
|
||
| func (ch *clickhouse) shouldDialForStrategy(currentConns int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a mutex might be safer.
|
Hi @tank-500m thanks for the PR. I'm more interested about your exact use-case for using client-side load-balancing here? (instead of server side LB). Are you trying to ingest/query from two complete ClickHouse setup with different URLs? or these just some replicas of same ClickHouse setup? (in that case why not server side LB? is it using TCP?) |
|
Hi! @kavirajk We are shifting our ingestion strategy from using Distributed tables to writing directly into Local tables. This change requires us to handle load balancing to distribute the data properly. Regarding the server-side LB, our infrastructure team has advised against using an intermediate Load Balancer for this setup, so we need to handle it on the client side. And yes, we are currently use TCP. +) these are replicas of the same ClickHouse setup. |
|
@tank-500m thanks for the context. my concern is the changes looks like ignoring the connection pooling entirely? Look at this gist that I used for testing. You will notice, even though, we |
|
You’re understanding it correctly. I first implemented it following the approach suggested here: I’m open to changing this behavior if needed. However, I’m not sure whether it’s correct to decide pooling eligibility based on whether each connection’s destination address is unique (i.e., treating connections as interchangeable only when they target the same resolved endpoint). It’s hard for me to judge what the “right” rule should be here. I’d appreciate your thoughts. Thanks. Are you concerned about something along the lines of what’s described here (#1135 (comment))? |
My concern with not respecting connection pool is this can lead to lots of latency and performance issues. The main goal is to avoid But this may need some refactoring on the connection pool side. Ideally this is how I envision
Currently the actual round-robin only happens during Another important think here is, the failover. Say one of the address is failing, can we not fail every other request (when using round robin)?. I know you wanted to merge small PR :) But I'm just trying to generalize the solution so that it works all the time without any performance impact. Any thoughts? |
|
@kavirajk My understanding is that you’re essentially proposing following “scene 1” from this comment (#1208 (comment)) as the baseline, and then add some kind of mitigation for addresses that fail—e.g., temporarily excluding a bad address (or backing off from it for N minutes) so don’t end up failing every other request in round-robin. |
That's correct :) The connection pooling and robust failovers are something crucial IMO, if we want to go in this direction. Although I'm not sure how complicated for the refactoring and implementation. Happy to guide to my knowledge if you want to pick that up. Also my other thought is, given the complexity of this design, for "production" use cases, better to set up proper proxy (like nginx) for even TCP load balancing. It's not that hard I did in on this PR for my testing. .
May I know what's the rationale behind not going with proxy approach?. Asking because to add it to my list "use-cases" to have nicer client-side load balancing on go client. |
Thanks! if we end up moving forward with this, I think I’ll need quite a bit of help. I really appreciate it.
One concern I have with load balancing via NGINX is that, with TCP (native protocol), it becomes connection-based. In that case, I’m not sure the traffic distribution will behave as we expect.
I’ll check with our infra team and get back to you. Thanks again. |
|
@kavirajk our infra team generally recommends using a Load Balancer (preferably DSR) for download traffic. And If the LB only spreads connections (rather than distributing individual requests), it still seems hard to avoid hotspots unless the client maintains a sufficiently large number of concurrent connections. |
Summary
This PR fixes #1208 by making the client-side load balancing dial strategies (round_robin/random) reliable under high concurrency.
Strategy-driven dialing is now governed by the idle connection cap (MaxIdleConns) and uses an atomic reservation to prevent concurrent acquires from over-dialing beyond the cap.
Checklist
Delete items not relevant to your PR: