Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add least_connections load balancing algorithm. #9025

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alowde
Copy link

@alowde alowde commented Sep 6, 2022

This commit adds a new option to the load-balance setting, 'least_connections'.
In this mode the controller keeps a count of connections to each upstream and
routes incoming connections to the upstream with the least connections open.
This is recommended for use with evenly-resourced upstream servers when
requests have a broad distribution in time to process, for example if some
requests require the upstream server to make a connection to a slow external
service.

If all requests take a fairly similar time to process or the upstreams serve at
different speeds then ewma or round_robin are likely more appropriate.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • CVE Report (Scanner found CVE and adding report)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

How Has This Been Tested?

This PR includes Lua unit tests and end-to-end tests written using the ginkgo framework.
The framework has been extended somewhat to support end-to-end tests that require
multiple upstreams with different configurations. All tests have been run including
existing tests, and all tests have passed.

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added unit and/or e2e tests to cover my changes.
  • All new and existing tests passed.
  • Added Release Notes.

Does my pull request need a release note?

Added a new load-balancing algorithm 'least_connections'. If the load-balance setting in the config map is set to 'least_connections' the controller will attempt to route incoming connections to one of the upstream endpoints with fewest current open connections.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 6, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: alowde / name: Ashley Lowde (ce63bec)

@k8s-ci-robot
Copy link
Contributor

@alowde: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 6, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @alowde!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Sep 6, 2022
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 6, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @alowde. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority area/docs area/lua Issues or PRs related to lua code size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 6, 2022
Copy link

@adkafka adkafka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alowde, I came across your PR because I too would like the least_connections load balancing algorithm for nginx. Thanks for making this!

I added some comments, but I'm no expert in lua load balancing algorithms. Hope it helps though. Would be great to get a least_connections algorithm added in.

Another worry I have with this implementation though, is we may end up choosing the same host over and over. I wonder if we could to least connection but by bucket. Ie, split the peers into N buckets (N=10?), and choose a random host within the bucket. This may be out of scope of this initial implementation, but I thought it was worth sharing. I wonder how other least_connection algorithms handle this (if at all).

for _, peer in pairs(peers) do
local conns = endpoints:get(get_upstream_name(peer))
if conns == nil then
endpoints:set(get_upstream_name(peer),0,600)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set an expire time on this? If we go 600 seconds without a peer's connection state changing at all, won't we then think the peer has 0 connections? I guess this helps with garbage collecting the dict, but I thought that was handled in the sync method. If so, should we make this value configurable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with depending on the sync method to handle garbage collection is trusting that the sync method will be reliably called with updates, and that we won't lose the list of peers between updates. With that said the main reason I was thinking to distrust sync was in case of nginx restarts - but then in that case the shared dict would be reset anyway.
I think we should set the expiration time to infinite.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree the expiration time should be infinite. Do you have a reason to be suspicious of the sync method not being called reliably? I hope it is something we can depend on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason to be suspicious, call it a generalised sense of professional paranoia :). Will fix the expiration time shortly.

local feasible_endpoints = {}

if #peers ~= 1 then
local lowestconns = 9999
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set this to max int or something higher? There very well could be peers with more than 9999 connections, which would mess up this calculation, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain what max int is in this case due to the various ways Lua could be compiled, but it's definitely higher than 9999 :). As a safe balance I've changed it to maximum signed 32 bit int, which should still be more connections than we're likely to support in other places.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is marked as resolved, but I still see this set to 9999. Once we have a pod w/ more than 10,000 connections, this will behave unexpectedly, right?

end

-- get peers with lowest connections
for _, peer in pairs(peers) do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have two for loops over pairs(peers). I think we can get away with just one by tracking lowestconns and feasible_endpoints together. If conns == lowestconns, we just append to feasible_endpoints, but if conns < lowestconns, we set feasible_endpoints to a new list with just the one peer in it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.peers = backend.endpoints

for _, endpoint_string in ipairs(normalized_endpoints_removed) do
ngx.shared.balancer_leastconn:delete(endpoint_string)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these endpoints here may not have the desired affect. If a pod has a bunch of connections, briefly fails its readiness probe (causing it to no longer be included as a backend), we will remove it from being tracked here. But if the pod becomes ready again, it may still be handling a bunch of connections, but we will initialize its value to 0 again.

Maybe this is OK, but it is worth calling out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's not ideal behaviour. I think it would be safe to instead set expiration time to (say) 30 minutes. This would ensure we still track connection count for connections up to that time while not allowing the shared dict to grow indefinitely.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting it to expire asynchronously makes some sense, though it will be difficult to choose the TTL.

It makes sense that this least conn load balancer would still track pods that are handling connections, even if the pod is not ready. It should decrement the number of connections on a backend that finishes a request, even if that pod is not ready for example.

Is there a way we can tell the difference between a pod that is not ready now (but therefore may become ready soon) versus a pod that is being killed (and therefore will never become ready again, and would be safe to fully delete)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can tell the difference between a pod that is not ready now (but therefore may become ready soon) versus a pod that is being killed (and therefore will never become ready again, and would be safe to fully delete)?

Not to the best of my knowledge, and that's the heart of the issue - all we know is that an upstream address has disappeared from the list of endpoints for that service. We could find out more by querying the API server but that introduces substantially more complexity that I think the ingress-nginx team would not want to take on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is still a bit concerning to me. It is unintuitive to me that a pod that becomes unready will result in ingress-nginx thinking it has 0 connections. Maybe setting the TTL to 2 days and just letting the entry in the map expire if we don't see it again in 2 days makes sense? (2 days is chosen somewhat arbitrarily)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed some oddities here, while working with pods which had its readiness checks interrupted (due to it asking the same port/endpoint as the incoming requests), causing it to receive extra connections for a while

See the below graphs
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more correlation to this effect:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: all of these effects were due to removing nodes from the table when they "go missing", plus there was an off-by-one error that has now been fixed on my fork, see the other thread

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 23, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: alowde / name: Ashley Lowde (e2b92f0)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 23, 2022
@alowde
Copy link
Author

alowde commented Oct 24, 2022

Hey @adkafka, thanks very much for taking the time to review this. I've added commits addressing some of the issues and would be interested to discuss the rest further.

Another worry I have with this implementation though, is we may end up choosing the same host over and over. I wonder if we could to least connection but by bucket. Ie, split the peers into N buckets (N=10?), and choose a random host within the bucket.

If there's multiple hosts with equal lowest connection count then this implementation will select one at random. If one host is significantly faster than the others it will get repeated requests, but the slower hosts will still get new requests as they complete old ones.

Originally my thinking was that if one server is handling most of the requests while others are slow, then the fast server should receive the requests. Your comment prompted me though to think about cases where one server is rapidly responding to requests with an error response (because of misconfiguration, error condition, or similar) and others are taking more time but succeeding. In this case we would actually produce a much worse result unless and until a health check kicked in and removed the bad peer.

I can think of a few options for handling this scenario and I'd be interested to hear thoughts from anyone:

  1. Ignore it, and document the issue.
  2. Randomly select from n servers with the lowest connection count, where n is adjustable or adaptive (i.e. bucketing with just the lowest bucket considered). Partly mitigates the 'one bad peer' effect at the cost of less predictable behaviour.
  3. Add an option for minimum connection count, effectively only balancing based on minimum connections above x. This would mean that peers with less than x open requests would be considered equal, which would reduce the impact of a peer returning quickly with a bad response.
  • Doesn't actually resolve the issue, just dilutes it at the cost of extra complexity.
  1. Add an option for minimum request time so that the open request counter delays decrement if a response returns too quickly.
  • Comes close to solving the problem but would require substantial additional complexity, and still requires the user to predict a good value for their workload.
  1. Same as 4, but for 5xx responses only. Possibly a more user-friendly implementation, but it makes me uneasy to be mixing application-layer logic here.
  2. Go crazy, implement full health-checking of responses and adjust weighting of peers based on (say) percentage of successful requests. Significantly more complexity/risk, and almost certainly more complexity than the ingress-nginx project maintainers are happy to take on.

Out of these I think 1, 2, 3, and 5 are reasonable. I think 1 and 5 have the most predictable behaviour. 2 feels like it doesn't offer a complete enough solution to be worth the trade off in predictability, but I'm certainly happy to discuss it further.

What do you think?


function _M.balance(self)
local peers = self.peers
local endpoint = peers[1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we behave when peers is empty?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that if we reached this line it would cause a Lua exception and Nginx would return an internal server error response, but I'm checking the actual behaviour with some more tests.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come across a test for this yet. Worth adding one?

for _, peer in pairs(peers) do
local conns = endpoints:get(get_upstream_name(peer))
if conns == nil then
endpoints:set(get_upstream_name(peer),0,600)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree the expiration time should be infinite. Do you have a reason to be suspicious of the sync method not being called reliably? I hope it is something we can depend on.

@adkafka
Copy link

adkafka commented Oct 26, 2022

Out of these I think 1, 2, 3, and 5 are reasonable. I think 1 and 5 have the most predictable behaviour. 2 feels like it doesn't offer a complete enough solution to be worth the trade off in predictability, but I'm certainly happy to discuss it further.

What do you think?

I think option #1 is best for the scope of this PR. It is the solution of least surprise. I am curious how other load balancers handle these failures though. I know nginx and HA Proxy both have a native least connections LB. Maybe we can look into that a little bit?

Another idea (likely out of scope of this PR) is instead of choosing between the endpoints with the least connections, we place the endpoints in some data structure and choose an endpoint randomly such that the endpoints with fewer connections are more likely to be chosen. This seems like the best balance between converging towards even connection count and limiting the blast radius of a single bad backend. I'm not sure how to implement this yet though...

@alowde
Copy link
Author

alowde commented Nov 6, 2022

I think option #1 is best for the scope of this PR. It is the solution of least surprise. I am curious how other load balancers handle these failures though. I know nginx and HA Proxy both have a native least connections LB. Maybe we can look into that a little bit?

I've now looked at the documentation for both HAProxy and Nginx Plus and it's interesting to see that neither of them mention this particular issue. I suspect the tighter integration of health-checks and load balancing in both of these means that it's not a major concern. Nginx Plus does have another load balancing mode that integrates least connections with a weighted moving average response time that would be less sensitive to this issue, but it still doesn't actually mitigate it.

Another idea (likely out of scope of this PR) is instead of choosing between the endpoints with the least connections, we place the endpoints in some data structure and choose an endpoint randomly such that the endpoints with fewer connections are more likely to be chosen. This seems like the best balance between converging towards even connection count and limiting the blast radius of a single bad backend. I'm not sure how to implement this yet though...

Agree that it's not in scope for this PR, but I'm not against implementing a more interesting structure if we can show an advantage for it. If we identify performance issues with the current implementation at scale I would consider assessing peer connection counts asynchronously, and at that point it would be easy to update a second table with an appropriate set of peers to be chosen from randomly.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 6, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 29, 2022
@adkafka
Copy link

adkafka commented Apr 9, 2023

Hey @alowde, I see this PR hasn't had any changes for a while now. What work remains? Anything I can do to help? I would love to use least_connections soon :)

@adkafka
Copy link

adkafka commented May 22, 2023

I'd like to leave this PR as is and then raise a new PR to either change the behaviour of the least_connections algorithm or (my preference) add a new algorithm option

That sounds good to me! Perhaps, in a follow-up PR we (I'm happy to do so) can propose a new annotation on the ingress that toggles this behavior (choose two backends randomly, select the one with the fewest connections) on.

Copy link

@adkafka adkafka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of the comment threads are marked "resolved", but the outstanding issue still remains. One example is: https://github.com/kubernetes/ingress-nginx/pull/9025/files#r1200712394. Perhaps a commit was missed?

- Don't set an expiration time on connection tracking table
- Fix too-low maximum connection count in peer evaluation loop
- Fix redundant loops when evaluating peer connection counts
- Reduce log level severity, tweak log messages for clarity
- Remove unused variables
- Fix whitespace and linting issues
@alowde
Copy link
Author

alowde commented May 23, 2023

@adkafka,

That sounds good to me! Perhaps, in a follow-up PR we (I'm happy to do so) can propose a new annotation on the ingress that toggles this behavior (choose two backends randomly, select the one with the fewest connections) on.

Sounds good. Please feel free to @ me if/when you do propose a new annotation and I'll be keen to help in any way I can. I think there's an opportunity here to write end-to-end tests that could exercise the load-balancing algorithms with a variety of different traffic patterns, and I'd be very interested to see the outcomes.

Several of the comment threads are marked "resolved", but the outstanding issue still remains.

I'm very sorry about this; in the rebase process it appears I was drawing on an old commit to recreate the PR. I've reviewed the conversation and re-resolved the issues raised. Thanks again for the review and your patience.


// NewAlwaysSlowEchoDeploymentWithOptions creates a new deployment of the always slow echo server with functional options.
// This server always sleeps for the specified number of milliseconds before responding. NOTE: values for sleepMillis
// >= 2000 will cause the deployment to fail health checks, causing false positive test failures.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding an assert for this? Seems like a time sink on debugging :)

Copy link

@adkafka adkafka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I suggest adding a TTL for all entries in this map, and using that TTL to eventually forget the connection counts for pods. This seems "less surprising" at least to me, but does come at the cost of more memory consumption. Curious what others think.

To make this argument concrete: Let's say we are using a readiness probe on our deployment to prevent a pod from receiving too many connections. Once a predetermined threshold is reached, the pod becomes unready, so the load balancer won't send it anymore traffic. With the implementation as-is, after the pod becomes ready again (connections shed over time), ingress-nginx will think it is a new pod with 0 connections, and send it all of the traffic.


function _M.balance(self)
local peers = self.peers
local endpoint = peers[1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come across a test for this yet. Worth adding one?

if peer_conn_count == nil then
-- Peer has never been recorded as having connections - add it to the connection
-- tracking table and the list of feasible peers
endpoints:set(peer_name,0,0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we don't have a default expiry on these entries. Does it make sense to add a long expiry (2 days) by default, just to catch any leaks?

self.peers = backend.endpoints

for _, endpoint_string in ipairs(normalized_endpoints_removed) do
ngx.shared.balancer_leastconn:delete(endpoint_string)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is still a bit concerning to me. It is unintuitive to me that a pod that becomes unready will result in ingress-nginx thinking it has 0 connections. Maybe setting the TTL to 2 days and just letting the entry in the map expire if we don't see it again in 2 days makes sense? (2 days is chosen somewhat arbitrarily)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ElvinEfendi
Copy link
Member

A few comments:

  • This is not a least connections load balancing, it's load balancing based on the least number of in-flight requests per server. It's a decent approximation of connections, but just wanted to point this out. You might want to rename the algorithm to better represent what it does.
  • I'd like to see how this implementation behaves in a k8s cluster with rolling endpoints. How does that compare to EWMA?
  • Similar to above, we need some benchmarking numbers.
  • I'd suggest using power of two choice technique used in the existing ewma lb algorithm, but instead of using response time as metric, just use the number of active/in-flight requests per server. Moreover, I don't think you need to synchronize request stats across NGINX workers, you can keep it per worker. It's desired in the ewma because the stats are coming from the endpoints in the response headers.

@strongjz
Copy link
Member

A couple things here:

  1. I agree with everything @ElvinEfendi says, i trust his judgement on LUA issues Id like to see what @tao12345666333 thinks as well
  2. Investigate using njs library as a replacement for lua/openrusty #9950
  3. Investigate using nginx-rust library as a replacement for lua/openrusty  #9949

@strongjz
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2023
@mrunelov
Copy link

mrunelov commented Aug 6, 2023

Regarding what @ElvinEfendi pointed out about in-flight requests, I have a question:
As mentioned I hope to use least_conn for websockets, would those be considered in-flight requests here?

And regarding the algorithm, is there a reason you don't just mimic what already exists in nginx, on a high level?
I'm not well-versed in nginx or ingress-nginx internals mind you :)

if util.is_blank(upstream) then
return
end
endpoints:incr(upstream,-1,0,0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling there is a bug somewhere (most likely with removing/adding endpoints) which either calls this twice, or otherwise does this the "wrong" way;

image

These graphs show a massively divergent usage, all similar in increasing amount and such, which leads me to think that this either never reaches "0" after some TTL, or otherwise does not properly "reset" when it sees that it has gone below zero

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After restarting the nginx deployment, the usage (in-flight request %) nicely converges;

image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is currently written, once a pod becomes unready, the ingress-nginx will effectively set the "number of connections" it is tracking to 0. I mentioned this as well as a an alternative solution in a comment above:

#9025 (review)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, that code is totally unnecessary, I've changed it around in my fork

(It still seems to exhibit an off-by-one error somewhere after indeterminate amount of time after an indeterminate event, which is why I haven't yet suggested it here)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found the off-by-one error, and i've committed it here, if anyone's interested; https://github.com/ShadowJonathan/ingress-nginx/tree/add-least_connections-load-balancing

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adkafka, alowde
Once this PR has been reviewed and has the lgtm label, please assign puerco for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ShadowJonathan
Copy link

ShadowJonathan commented Dec 8, 2023

I'd like to see how this implementation behaves in a k8s cluster with rolling endpoints. How does that compare to EWMA?

@ElvinEfendi it seems to not particularly like those, at the moment. It creates the same "desyncing" situation ive posted in a few threads;

image

Copy link

@ShadowJonathan ShadowJonathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, alternative_backends is not defined, meaning that canary deployments will not properly route and work

@strongjz
Copy link
Member

strongjz commented Dec 8, 2023

/assign @tao12345666333

@tao12345666333
Copy link
Member

Sorry for the long delay.

I will be back on Tuesday

@exherb
Copy link

exherb commented May 19, 2024

what is the status for this pull request? @tao12345666333

@em-le-ts
Copy link

+1

@alowde
Copy link
Author

alowde commented Jun 21, 2024

@exherb this PR is not in a particularly healthy state, sorry. Others have identified specific fixes required and general issues, and you might find @ShadowJonathan 's fork useful, but before this code could be merged it would need a fair bit of cleaning up.

I haven't worked at the company which used this code for ~18 months, but if I can get some spare time I'll try to at least bring it up to a usable state.

@bashirmindee
Copy link

any news regarding this ? can we open another PR if this one is becoming stale ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs area/lua Issues or PRs related to lua code cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.