Skip to content

[ARO-11484] Fix fixetcd GA #4034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

[ARO-11484] Fix fixetcd GA #4034

wants to merge 4 commits into from

Conversation

bitoku
Copy link
Contributor

@bitoku bitoku commented Jan 1, 2025

Which issue this PR addresses:

Fixes https://issues.redhat.com/browse/ARO-11484

What this PR does / why we need it:

This PR fixes fixetcd GA.

  • Fix the label selector
  • Change the image to ubi9 because it got glibc error with ubi8.
  • Change the job to a pod. It only runs one pod, so there's no reason to use a job. Also its watcher returns a job object not a pod object.

Also I added e2e, but it takes so long and it can't run in parallel, so I make it regression test.
It doesn't run by default in CI.

E2E test is intended to ensure the master replacement SOP is valid because I couldn't reproduce the etcd issue with 100% possiblity.
I think it's enough solution until we find the reliable reproducible scenario.

Test plan for issue:

e2e

Is there any documentation that needs to be updated for this PR?

no

How do you know this will function as expected in production?

e2e, master replacement & fixetcd GA

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, and it was thoughtful to add an E2E test. I made some small suggestions and have one other thing to point out.

There's a part of the user story that I don't see addressed in the PR: I want to add a conditional check for when the node IP address remains the same, and delete the existing etcd Pod if it's in a crashloop. Is that part of the story still needed?

@bitoku
Copy link
Contributor Author

bitoku commented Jan 2, 2025

@kimorris27 Thank you for taking a look!

There's a part of the user story that I don't see addressed in the PR: I want to add a conditional check for when the node IP address remains the same, and delete the existing etcd Pod if it's in a crashloop. Is that part of the story still needed?

I don't think it's actually needed.
During the test, I got the error many times when the pod's IP address is unchanged, and fixetcd API fixed the issue.
Fixetcd just deletes the etcd member, but it seems when there's a change in etcd, etcd-operator automatically redeploys all etcd pods.

You might be able to reproduce it by running e2e. When you get 200 from the fixetcd api, the etcd pod should be CrashLoopBackoff.

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given the responses to my original comments. The only issue I see now is that it looks like some unit tests in pkg/frontend need to be updated to reflect the changes to fixetcd.go, but maybe someone else will need to pick up this work first?

kimorris27
kimorris27 previously approved these changes Jan 3, 2025
@s-fairchild
Copy link
Collaborator

Have you tested this with a dev cluster?

Copy link

github-actions bot commented Feb 7, 2025

Please rebase pull request.

@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Mar 19, 2025
@komidore64
Copy link
Collaborator

rebased

s-fairchild
s-fairchild previously approved these changes Mar 19, 2025
Copy link
Collaborator

@s-fairchild s-fairchild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Since we have E2E tests, we can merge without holding for manual testing.

@kimorris27
Copy link
Contributor

LGTM. Since we have E2E tests, we can merge without holding for manual testing.

Even though we have E2E testing, it still needs to be run manually per Ayato's original description:

Also I added e2e, but it takes so long and it can't run in parallel, so I make it regression test.
It doesn't run by default in CI.

I think it would be good for someone to run it once more before merge.

@s-fairchild
Copy link
Collaborator

I think it would be good for someone to run it once more before merge.

I agree, my only concern is it's difficult to induce the problem (etcd member list IP mismatch).
Originally to achieve this I added a new NIC to a master node, then deleted the old NIC, causing a new IP address to be associated with the master. I can try this method again.

@bitoku bitoku closed this Apr 15, 2025
@komidore64
Copy link
Collaborator

Since the PR was closed, is there any opposition to me deleting this branch?

I will assume there is no opposition of I don't see any comments by end-of-day 9 June 2025. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants