-
Notifications
You must be signed in to change notification settings - Fork 204
Manual rollback after grace period #9643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
|
This pull request is now in conflicts. Could you fix it? 🙏 |
33afefd to
1c65022
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
74ae1d1 to
c51fb1b
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
5199c2b to
38b9b96
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
4c89a3b to
411205b
Compare
|
|
This pull request is now in conflicts. Could you fix it? 🙏 |
29c1a6b to
cb90912
Compare
e18f176 to
2dbed3f
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
fc1c983 to
dcf0ede
Compare
|
buildkite test this |
|
Code looks good to me, I went through the manual test steps and it worked with the following observations:
|
|
@ycombinator still has open comments to resolve, @ycombinator can you confirm these are addressed and approve? This LGTM but I want to avoid approving when there are open comments from someone else remaining. |
ycombinator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pchila Thanks for addressing my comments. I've resolved most of them, just a couple where I'd like to see a bit more in the inline comments explaining the "why" + one other question about some code. After that, this LGTM!
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @pchila |
ycombinator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* Allow for multiple directories to be specified during cleanup * refactor manual rollback function and tests on a separate file * Split manual rollback between watching and non-watching cases * Implement manual rollback from list of agent installs * fix lint errors * Normalize install descriptors at startup * Add integration test for manual rollback after grace period * fix linter errors * Set commit hash in TTLMarker when preparing available rollbacks * Pass versionedHomesToKeep to installModifier.Cleanup() * change check for running TTL marker normalization at startup * remove references to install registry * implement code review feedback * fixup! implement code review feedback


What does this PR do?
This PR builds upon #8767 by allowing to manually rollback an elastic agent even after the grace period ends.
This PR uses a registry of elastic-agent available rollbacks introduced with PR #10344.
Having a registry of agent versions available for rollback allows for detection of possible manual rollback targets. The same available rollbacks are written in
.update-markerfile so that the watcher will preserve them at the end of the grace period, still allowing for a rollback after the watcher exits.Installs available for rollbacks are assigned a TTL, so they can be cleaned up after a given time (governed by the
agent.upgrade.rollback.windowsetting).With this PR, cleanup and normalization of agent installs happens only at startup.
In a follow-up PR:
Why is it important?
This PR allows to manually rollback an elastic-agent upgrade within a configurable window that extends beyond the grace period (the period during which an automatic rollback may be triggered by the upgraded agent misbehaving).
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragmentsusing the changelog toolDisruptive User Impact
No impact for users as the feature is deactivated when using the default value of
agent.upgrade.rollback.window: 0How to test this PR locally
SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz PLATFORMS="linux/amd64" mage -v package9.3.0-SNAPSHOTas usualWait for the new agent to come online and check the upgrade details for UPG_WATCHING state. After the grace period of 1 minute, upgrade details should disappear and the watcher should exit.
Verify that the
data/elastic-agent-9.3.0-SNAPSHOT-<hash>directory is still present after watcher exited.Manually rollback to the previous version:
elastic-agent statusand verify that the agent restarted with version9.2.0-SNAPSHOTand that the directorydata/elastic-agent-9.3.0+build20251022000000-<hash>contains only logs.Related issues
Questions to ask yourself