-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add least_connections load balancing algorithm. #9025
base: main
Are you sure you want to change the base?
Add least_connections load balancing algorithm. #9025
Conversation
|
@alowde: This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Welcome @alowde! |
Hi @alowde. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alowde, I came across your PR because I too would like the least_connections load balancing algorithm for nginx. Thanks for making this!
I added some comments, but I'm no expert in lua load balancing algorithms. Hope it helps though. Would be great to get a least_connections algorithm added in.
Another worry I have with this implementation though, is we may end up choosing the same host over and over. I wonder if we could to least connection but by bucket. Ie, split the peers into N buckets (N=10?), and choose a random host within the bucket. This may be out of scope of this initial implementation, but I thought it was worth sharing. I wonder how other least_connection algorithms handle this (if at all).
for _, peer in pairs(peers) do | ||
local conns = endpoints:get(get_upstream_name(peer)) | ||
if conns == nil then | ||
endpoints:set(get_upstream_name(peer),0,600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we set an expire time on this? If we go 600
seconds without a peer's connection state changing at all, won't we then think the peer has 0 connections? I guess this helps with garbage collecting the dict, but I thought that was handled in the sync
method. If so, should we make this value configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with depending on the sync method to handle garbage collection is trusting that the sync method will be reliably called with updates, and that we won't lose the list of peers between updates. With that said the main reason I was thinking to distrust sync was in case of nginx restarts - but then in that case the shared dict would be reset anyway.
I think we should set the expiration time to infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree the expiration time should be infinite. Do you have a reason to be suspicious of the sync method not being called reliably? I hope it is something we can depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason to be suspicious, call it a generalised sense of professional paranoia :). Will fix the expiration time shortly.
local feasible_endpoints = {} | ||
|
||
if #peers ~= 1 then | ||
local lowestconns = 9999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set this to max int or something higher? There very well could be peers with more than 9999
connections, which would mess up this calculation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain what max int is in this case due to the various ways Lua could be compiled, but it's definitely higher than 9999 :). As a safe balance I've changed it to maximum signed 32 bit int, which should still be more connections than we're likely to support in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread is marked as resolved, but I still see this set to 9999
. Once we have a pod w/ more than 10,000 connections, this will behave unexpectedly, right?
end | ||
|
||
-- get peers with lowest connections | ||
for _, peer in pairs(peers) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have two for loops over pairs(peers)
. I think we can get away with just one by tracking lowestconns
and feasible_endpoints
together. If conns == lowestconns
, we just append to feasible_endpoints
, but if conns < lowestconns
, we set feasible_endpoints
to a new list with just the one peer in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self.peers = backend.endpoints | ||
|
||
for _, endpoint_string in ipairs(normalized_endpoints_removed) do | ||
ngx.shared.balancer_leastconn:delete(endpoint_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these endpoints here may not have the desired affect. If a pod has a bunch of connections, briefly fails its readiness probe (causing it to no longer be included as a backend), we will remove it from being tracked here. But if the pod becomes ready again, it may still be handling a bunch of connections, but we will initialize its value to 0 again.
Maybe this is OK, but it is worth calling out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's not ideal behaviour. I think it would be safe to instead set expiration time to (say) 30 minutes. This would ensure we still track connection count for connections up to that time while not allowing the shared dict to grow indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting it to expire asynchronously makes some sense, though it will be difficult to choose the TTL.
It makes sense that this least conn load balancer would still track pods that are handling connections, even if the pod is not ready. It should decrement the number of connections on a backend that finishes a request, even if that pod is not ready for example.
Is there a way we can tell the difference between a pod that is not ready now (but therefore may become ready soon) versus a pod that is being killed (and therefore will never become ready again, and would be safe to fully delete)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can tell the difference between a pod that is not ready now (but therefore may become ready soon) versus a pod that is being killed (and therefore will never become ready again, and would be safe to fully delete)?
Not to the best of my knowledge, and that's the heart of the issue - all we know is that an upstream address has disappeared from the list of endpoints for that service. We could find out more by querying the API server but that introduces substantially more complexity that I think the ingress-nginx team would not want to take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread is still a bit concerning to me. It is unintuitive to me that a pod that becomes unready will result in ingress-nginx thinking it has 0 connections. Maybe setting the TTL to 2 days and just letting the entry in the map expire if we don't see it again in 2 days makes sense? (2 days is chosen somewhat arbitrarily)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: all of these effects were due to removing nodes from the table when they "go missing", plus there was an off-by-one error that has now been fixed on my fork, see the other thread
|
Hey @adkafka, thanks very much for taking the time to review this. I've added commits addressing some of the issues and would be interested to discuss the rest further.
If there's multiple hosts with equal lowest connection count then this implementation will select one at random. If one host is significantly faster than the others it will get repeated requests, but the slower hosts will still get new requests as they complete old ones. Originally my thinking was that if one server is handling most of the requests while others are slow, then the fast server should receive the requests. Your comment prompted me though to think about cases where one server is rapidly responding to requests with an error response (because of misconfiguration, error condition, or similar) and others are taking more time but succeeding. In this case we would actually produce a much worse result unless and until a health check kicked in and removed the bad peer. I can think of a few options for handling this scenario and I'd be interested to hear thoughts from anyone:
Out of these I think 1, 2, 3, and 5 are reasonable. I think 1 and 5 have the most predictable behaviour. 2 feels like it doesn't offer a complete enough solution to be worth the trade off in predictability, but I'm certainly happy to discuss it further. What do you think? |
|
||
function _M.balance(self) | ||
local peers = self.peers | ||
local endpoint = peers[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we behave when peers
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that if we reached this line it would cause a Lua exception and Nginx would return an internal server error response, but I'm checking the actual behaviour with some more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't come across a test for this yet. Worth adding one?
for _, peer in pairs(peers) do | ||
local conns = endpoints:get(get_upstream_name(peer)) | ||
if conns == nil then | ||
endpoints:set(get_upstream_name(peer),0,600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree the expiration time should be infinite. Do you have a reason to be suspicious of the sync method not being called reliably? I hope it is something we can depend on.
I think option #1 is best for the scope of this PR. It is the solution of least surprise. I am curious how other load balancers handle these failures though. I know nginx and HA Proxy both have a native least connections LB. Maybe we can look into that a little bit? Another idea (likely out of scope of this PR) is instead of choosing between the endpoints with the least connections, we place the endpoints in some data structure and choose an endpoint randomly such that the endpoints with fewer connections are more likely to be chosen. This seems like the best balance between converging towards even connection count and limiting the blast radius of a single bad backend. I'm not sure how to implement this yet though... |
I've now looked at the documentation for both HAProxy and Nginx Plus and it's interesting to see that neither of them mention this particular issue. I suspect the tighter integration of health-checks and load balancing in both of these means that it's not a major concern. Nginx Plus does have another load balancing mode that integrates least connections with a weighted moving average response time that would be less sensitive to this issue, but it still doesn't actually mitigate it.
Agree that it's not in scope for this PR, but I'm not against implementing a more interesting structure if we can show an advantage for it. If we identify performance issues with the current implementation at scale I would consider assessing peer connection counts asynchronously, and at that point it would be easy to update a second table with an appropriate set of peers to be chosen from randomly. |
Hey @alowde, I see this PR hasn't had any changes for a while now. What work remains? Anything I can do to help? I would love to use |
That sounds good to me! Perhaps, in a follow-up PR we (I'm happy to do so) can propose a new annotation on the ingress that toggles this behavior (choose two backends randomly, select the one with the fewest connections) on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of the comment threads are marked "resolved", but the outstanding issue still remains. One example is: https://github.com/kubernetes/ingress-nginx/pull/9025/files#r1200712394. Perhaps a commit was missed?
- Don't set an expiration time on connection tracking table - Fix too-low maximum connection count in peer evaluation loop - Fix redundant loops when evaluating peer connection counts - Reduce log level severity, tweak log messages for clarity - Remove unused variables - Fix whitespace and linting issues
Sounds good. Please feel free to @ me if/when you do propose a new annotation and I'll be keen to help in any way I can. I think there's an opportunity here to write end-to-end tests that could exercise the load-balancing algorithms with a variety of different traffic patterns, and I'd be very interested to see the outcomes.
I'm very sorry about this; in the rebase process it appears I was drawing on an old commit to recreate the PR. I've reviewed the conversation and re-resolved the issues raised. Thanks again for the review and your patience. |
|
||
// NewAlwaysSlowEchoDeploymentWithOptions creates a new deployment of the always slow echo server with functional options. | ||
// This server always sleeps for the specified number of milliseconds before responding. NOTE: values for sleepMillis | ||
// >= 2000 will cause the deployment to fail health checks, causing false positive test failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding an assert
for this? Seems like a time sink on debugging :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I suggest adding a TTL for all entries in this map, and using that TTL to eventually forget the connection counts for pods. This seems "less surprising" at least to me, but does come at the cost of more memory consumption. Curious what others think.
To make this argument concrete: Let's say we are using a readiness probe on our deployment to prevent a pod from receiving too many connections. Once a predetermined threshold is reached, the pod becomes unready, so the load balancer won't send it anymore traffic. With the implementation as-is, after the pod becomes ready again (connections shed over time), ingress-nginx will think it is a new pod with 0 connections, and send it all of the traffic.
|
||
function _M.balance(self) | ||
local peers = self.peers | ||
local endpoint = peers[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't come across a test for this yet. Worth adding one?
if peer_conn_count == nil then | ||
-- Peer has never been recorded as having connections - add it to the connection | ||
-- tracking table and the list of feasible peers | ||
endpoints:set(peer_name,0,0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we don't have a default expiry on these entries. Does it make sense to add a long expiry (2 days) by default, just to catch any leaks?
self.peers = backend.endpoints | ||
|
||
for _, endpoint_string in ipairs(normalized_endpoints_removed) do | ||
ngx.shared.balancer_leastconn:delete(endpoint_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread is still a bit concerning to me. It is unintuitive to me that a pod that becomes unready will result in ingress-nginx thinking it has 0 connections. Maybe setting the TTL to 2 days and just letting the entry in the map expire if we don't see it again in 2 days makes sense? (2 days is chosen somewhat arbitrarily)
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
A few comments:
|
A couple things here:
|
/hold |
Regarding what @ElvinEfendi pointed out about in-flight requests, I have a question: And regarding the algorithm, is there a reason you don't just mimic what already exists in nginx, on a high level? |
if util.is_blank(upstream) then | ||
return | ||
end | ||
endpoints:incr(upstream,-1,0,0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling there is a bug somewhere (most likely with removing/adding endpoints) which either calls this twice, or otherwise does this the "wrong" way;

These graphs show a massively divergent usage, all similar in increasing amount and such, which leads me to think that this either never reaches "0" after some TTL, or otherwise does not properly "reset" when it sees that it has gone below zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is currently written, once a pod becomes unready, the ingress-nginx will effectively set the "number of connections" it is tracking to 0. I mentioned this as well as a an alternative solution in a comment above:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, that code is totally unnecessary, I've changed it around in my fork
(It still seems to exhibit an off-by-one error somewhere after indeterminate amount of time after an indeterminate event, which is why I haven't yet suggested it here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found the off-by-one error, and i've committed it here, if anyone's interested; https://github.com/ShadowJonathan/ingress-nginx/tree/add-least_connections-load-balancing
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adkafka, alowde The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ElvinEfendi it seems to not particularly like those, at the moment. It creates the same "desyncing" situation ive posted in a few threads; ![]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, alternative_backends
is not defined, meaning that canary deployments will not properly route and work
/assign @tao12345666333 |
Sorry for the long delay. I will be back on Tuesday |
what is the status for this pull request? @tao12345666333 |
+1 |
@exherb this PR is not in a particularly healthy state, sorry. Others have identified specific fixes required and general issues, and you might find @ShadowJonathan 's fork useful, but before this code could be merged it would need a fair bit of cleaning up. I haven't worked at the company which used this code for ~18 months, but if I can get some spare time I'll try to at least bring it up to a usable state. |
any news regarding this ? can we open another PR if this one is becoming stale ? |
This commit adds a new option to the load-balance setting, 'least_connections'.
In this mode the controller keeps a count of connections to each upstream and
routes incoming connections to the upstream with the least connections open.
This is recommended for use with evenly-resourced upstream servers when
requests have a broad distribution in time to process, for example if some
requests require the upstream server to make a connection to a slow external
service.
If all requests take a fairly similar time to process or the upstreams serve at
different speeds then ewma or round_robin are likely more appropriate.
Types of changes
How Has This Been Tested?
This PR includes Lua unit tests and end-to-end tests written using the ginkgo framework.
The framework has been extended somewhat to support end-to-end tests that require
multiple upstreams with different configurations. All tests have been run including
existing tests, and all tests have passed.
Checklist:
Does my pull request need a release note?