Skip to content

Conversation

@adilGhaffarDev
Copy link
Contributor

@adilGhaffarDev adilGhaffarDev commented Feb 6, 2025

What this PR does / why we need it:
Leader election fails when IPManagement is called from garbageCollectPodIPs in pod.go, this PR fixes the KubernetesIPAM and Client creation in pod.go, which resolves the leader election issue.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #483

Special notes for your reviewer (optional):

@adilGhaffarDev
Copy link
Contributor Author

/cc @dougbtv @s1061123 please check.

@adilGhaffarDev
Copy link
Contributor Author

@thomasferrandiz please check

@adilGhaffarDev
Copy link
Contributor Author

This patch works perfectly on top of version 0.8.0, but it still does not work on the main branch.

I would like to ask if it's possible to create a patch release (0.8.1) that includes this fix.

Additionally, I tried using Node Slice Fast IPAM since it addresses leader election issues, but instead of helping, it worsened the situation. I have opened an issue explaining the problem: #558. Please take a look.

@maiqueb
Copy link
Collaborator

maiqueb commented Feb 11, 2025

This patch works perfectly on top of version 0.8.0, but it still does not work on the main branch.

I think this is not how it should be. You need to make your fix to the main branch, and once it works, you back-port it to the release branch.

I would like to ask if it's possible to create a patch release (0.8.1) that includes this fix.

Additionally, I tried using Node Slice Fast IPAM since it addresses leader election issues, but instead of helping, it worsened the situation. I have opened an issue explaining the problem: #558. Please take a look.

@adilGhaffarDev
Copy link
Contributor Author

I think this is not how it should be. You need to make your fix to the main branch, and once it works, you back-port it to the release branch.

but there is no 0.8.0 branch and code changed quite a lot since 0.8.0 with addition of fast ipam node slice, so main will need a different fix. Is there a way to merge just this fix on 0.8.0? I know merging it on main is not going to help, I opened the PR just to share my patch.

/hold

@maiqueb
Copy link
Collaborator

maiqueb commented Feb 11, 2025

I think we should cut a release branch for 0.8 as you're pointing out.

@dougbtv could you assist in that? What do you think ?

@adilGhaffarDev
Copy link
Contributor Author

I think we should cut a release branch for 0.8 as you're pointing out.

yes, that would be great.

@mlguerrero12
Copy link
Collaborator

I think this is not how it should be. You need to make your fix to the main branch, and once it works, you back-port it to the release branch.

but there is no 0.8.0 branch and code changed quite a lot since 0.8.0 with addition of fast ipam node slice, so main will need a different fix. Is there a way to merge just this fix on 0.8.0? I know merging it on main is not going to help, I opened the PR just to share my patch.

/hold

There shouldn't be major changes. The fast ipam node slice was meant to be a optional feature. Just for experimentation purposes. We should aim at fixing master primarily

@adilGhaffarDev
Copy link
Contributor Author

There shouldn't be major changes. The fast ipam node slice was meant to be a optional feature. Just for experimentation purposes. We should aim at fixing master primarily

It looks like the pod controller is not running at all on main branch. I don’t see any logs from pkg/controlloop/pod.go, and I’m unsure why it isn’t running. Or maybe it's running but not called from anywhere.

@mlguerrero12
Copy link
Collaborator

That's a different issue. I'll have a look as well. But that shouldn't be a reason to create a fix for a tag and not in the master branch.

@adilGhaffarDev adilGhaffarDev force-pushed the leftover-podref-fix/adil branch 2 times, most recently from 7bb198a to bd450af Compare February 12, 2025 11:19
@coveralls
Copy link

coveralls commented Feb 13, 2025

Pull Request Test Coverage Report for Build 13952054331

Details

  • 8 of 23 (34.78%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 51.575%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/storage/kubernetes/ipam.go 0 15 0.0%
Totals Coverage Status
Change from base Build 13947338124: 0.05%
Covered Lines: 1965
Relevant Lines: 3810

💛 - Coveralls

@dougbtv
Copy link
Member

dougbtv commented Feb 20, 2025

Really appreciate the contrib, and I'm definitely open to making a v0.8.1 and releasing it.

However, I'd really appreciate the eye towards getting it fixed against the current master branch. I will add this to the maintainer's agenda for next week (which I will miss, but can follow up subsequently)

@adilGhaffarDev
Copy link
Contributor Author

However, I'd really appreciate the eye towards getting it fixed against the current master branch. I will add this to the maintainer's agenda for next week (which I will miss, but can follow up subsequently)

yes, I will try to fix it on the main too.
Regarding this PR, it not only fixes the issue in 0.8.0, it's also fixing the following broken code in pod.go

https://github.com/k8snetworkplumbingwg/whereabouts/blob/d3633fa680bd4b8b5e8425b5a643fed1eec5f568/pkg/controlloop/pod.go#L226C1-L242C7

@adilGhaffarDev
Copy link
Contributor Author

@mlguerrero12 kindly check this when you get time.

@mlguerrero12
Copy link
Collaborator

@adilGhaffarDev, I also see that the pod controller is not running on master. Will provide a fix soon

@adilGhaffarDev
Copy link
Contributor Author

@adilGhaffarDev, I also see that the pod controller is not running on master. Will provide a fix soon

Thank you. This fix should still be needed. Can we merge this too once you have fixed the pod controller?

@mlguerrero12
Copy link
Collaborator

yes, I will review and test these changes after we fix master

@mlguerrero12
Copy link
Collaborator

Just realized that this has already been reported
#575
Suggested fix #577

@mlguerrero12
Copy link
Collaborator

@adilGhaffarDev, I still need to test some things from the current code, but it seems to me you are not fixing the issue with the leader election (occasional missing of pod names) but some missing data needed in IPManagementKubernetesUpdate to perform the deallocation. Am I seeing this correctly?

@adilGhaffarDev
Copy link
Contributor Author

@adilGhaffarDev, I still need to test some things from the current code, but it seems to me you are not fixing the issue with the leader election (occasional missing of pod names) but some missing data needed in IPManagementKubernetesUpdate to perform the deallocation. Am I seeing this correctly?

When cleanupFunc is called from garbageCollectPodIPs for deallocation, it fails due to improperly initialized client and wbClient in pod.go. Specifically, it fails at leader election.

The old code initializes the client and wbClient incorrectly, leading to issues with leader election:

client := *wbclient.NewKubernetesClient(nil, pc.k8sClient)
wbClient := &wbclient.KubernetesIPAM{
				Client: client,
				Config: *ipamConfig,
			}

This PR ensures proper initialization so that leader election and deallocation work correctly:

client := *wbclient.NewKubernetesClient(pc.wbClient, pc.k8sClient)
k8sIPAM := wbclient.NewKubernetesIPAMWithClient(allocation.ContainerID, allocation.IfName, *ipamConfig, pool.Namespace, client)

With this fix, cleanupFunc receives correctly initialized parameters, allowing leader election to function properly and ensuring successful deallocation.

@mlguerrero12
Copy link
Collaborator

I can also see that there are some stuff missing. So, pc.cleanupFunc is IPManagement. This ones calls newLeaderElector

	le, leader, deposed := newLeaderElector(client.clientSet, client.namespace, ipamConf.PodNamespace, ipamConf.PodName, ipamConf.LeaderLeaseDuration, ipamConf.LeaderRenewDeadline, ipamConf.LeaderRetryPeriod)

client.clientSet is the k8s client that is already defined but didn't realize until now that client.namespace is not set. So, yes, this one is missing for the leader election part.

After that, it calls IPManagementKubernetesUpdate. This one need besides the containterID for deallocation, the whereabouts client that is not set. These things are missing.

Your findings are correct. Just wanted to clarify the different sections.

Good catch @adilGhaffarDev

@adilGhaffarDev adilGhaffarDev force-pushed the leftover-podref-fix/adil branch from bd450af to e4bf4f3 Compare March 14, 2025 12:54
@mlguerrero12
Copy link
Collaborator

@adilGhaffarDev, could you please rebase?

Signed-off-by: Muhammad Adil Ghaffar <[email protected]>
@adilGhaffarDev adilGhaffarDev force-pushed the leftover-podref-fix/adil branch from e4bf4f3 to 2dd0c8f Compare March 19, 2025 16:45
@adilGhaffarDev
Copy link
Contributor Author

@adilGhaffarDev, could you please rebase?

done

@adilGhaffarDev
Copy link
Contributor Author

after the merge of #577 I have tested this PR on main branch and the issue is fixed on main too.
I am following the same steps explained here: #483 (comment) , here is in more detail how to reproduce the issue: #480 (comment)

@mlguerrero12 mlguerrero12 merged commit 18d7b92 into k8snetworkplumbingwg:master Mar 20, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Scale down of deployment left pod references undeleted

5 participants