Skip to content

Conversation

@pchila
Copy link
Member

@pchila pchila commented Aug 29, 2025

What does this PR do?

This PR builds upon #8767 by allowing to manually rollback an elastic agent even after the grace period ends.
This PR uses a registry of elastic-agent available rollbacks introduced with PR #10344.

Having a registry of agent versions available for rollback allows for detection of possible manual rollback targets. The same available rollbacks are written in .update-marker file so that the watcher will preserve them at the end of the grace period, still allowing for a rollback after the watcher exits.

Installs available for rollbacks are assigned a TTL, so they can be cleaned up after a given time (governed by the agent.upgrade.rollback.window setting).

With this PR, cleanup and normalization of agent installs happens only at startup.
In a follow-up PR:

  • the agent will be able to schedule a cleanup to run without needing a restart
  • old agent installs available for rollback will be cleaned up when initiating a new upgrade operation (in order to save disk space)

Why is it important?

This PR allows to manually rollback an elastic-agent upgrade within a configurable window that extends beyond the grace period (the period during which an automatic rollback may be triggered by the upgraded agent misbehaving).

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No impact for users as the feature is deactivated when using the default value of agent.upgrade.rollback.window: 0

How to test this PR locally

  1. Package elastic agent twice from this PR:
SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
AGENT_PACKAGE_VERSION="9.3.0+build20251022000000" BEAT_VERSION="9.3.0-SNAPSHOT" EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
  1. Install the version 9.3.0-SNAPSHOT as usual
  2. Set a rollback window duration > 0, for example 10 minutes and a shorter grace period and check_interval for watcher:
agent.upgrade:
  watcher:
      grace_period: 1m
      error_check.interval: 5s
  rollback:
      window: 10m
  1. Trigger an update to the other package (saved on disk):
elastic-agent upgrade --skip-verify --source-uri=file:///vagrant/build/distributions 9.3.0+build20251022000000
  1. Wait for the new agent to come online and check the upgrade details for UPG_WATCHING state. After the grace period of 1 minute, upgrade details should disappear and the watcher should exit.

  2. Verify that the data/elastic-agent-9.3.0-SNAPSHOT-<hash> directory is still present after watcher exited.

  3. Manually rollback to the previous version:

elastic-agent upgrade --rollback 9.3.0-SNAPSHOT
  1. Check the output of elastic-agent status and verify that the agent restarted with version 9.2.0-SNAPSHOT and that the directory data/elastic-agent-9.3.0+build20251022000000-<hash> contains only logs.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify
Copy link
Contributor

mergify bot commented Aug 29, 2025

⚠️ The sha of the head commit of this PR conflicts with #8767. Mergify cannot evaluate rules on this PR. ⚠️

@mergify mergify bot assigned pchila Aug 29, 2025
@mergify
Copy link
Contributor

mergify bot commented Aug 29, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mergify
Copy link
Contributor

mergify bot commented Sep 9, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b rollback-watcher-cleanup upstream/rollback-watcher-cleanup
git merge upstream/main
git push upstream rollback-watcher-cleanup

@pchila pchila force-pushed the rollback-watcher-cleanup branch 2 times, most recently from 33afefd to 1c65022 Compare September 17, 2025 12:06
@mergify
Copy link
Contributor

mergify bot commented Sep 23, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b rollback-watcher-cleanup upstream/rollback-watcher-cleanup
git merge upstream/main
git push upstream rollback-watcher-cleanup

@pchila pchila force-pushed the rollback-watcher-cleanup branch from 74ae1d1 to c51fb1b Compare September 23, 2025 12:37
@mergify
Copy link
Contributor

mergify bot commented Sep 25, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b rollback-watcher-cleanup upstream/rollback-watcher-cleanup
git merge upstream/main
git push upstream rollback-watcher-cleanup

@pchila pchila force-pushed the rollback-watcher-cleanup branch from 5199c2b to 38b9b96 Compare September 25, 2025 13:31
@mergify
Copy link
Contributor

mergify bot commented Sep 26, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b rollback-watcher-cleanup upstream/rollback-watcher-cleanup
git merge upstream/main
git push upstream rollback-watcher-cleanup

@pchila pchila force-pushed the rollback-watcher-cleanup branch from 4c89a3b to 411205b Compare September 29, 2025 16:50
@elastic-sonarqube
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@mergify
Copy link
Contributor

mergify bot commented Oct 1, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b rollback-watcher-cleanup upstream/rollback-watcher-cleanup
git merge upstream/main
git push upstream rollback-watcher-cleanup

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pchila pchila force-pushed the rollback-watcher-cleanup branch from fc1c983 to dcf0ede Compare October 28, 2025 10:48
@pchila pchila requested review from cmacknz and ycombinator October 28, 2025 13:46
@pchila
Copy link
Member Author

pchila commented Oct 28, 2025

buildkite test this

@cmacknz
Copy link
Member

cmacknz commented Oct 28, 2025

Code looks good to me, I went through the manual test steps and it worked with the following observations:

  • I don't see rollbacks available reported from the status command, just the watching state:
❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
├─ elastic-agent
│  └─ status: (HEALTHY) Running
└─ upgrade_details
   ├─ target_version: 9.3.0+build20251022000000
   ├─ state: UPG_WATCHING
   └─ metadata
  • The agent status appears that it will indefinitely contain the rolled back status, this is fine since that's what happened. Someone may ask us to change this later I think it will permanently highlight the agent in the Fleet UI with the rolled back status which is less useful if it was manual and not automatic.
❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
├─ elastic-agent
│  └─ status: (HEALTHY) Running
└─ upgrade_details
   ├─ target_version: 9.3.0+build20251022000000
   ├─ state: UPG_ROLLBACK
   └─ metadata
      └─ reason: manual rollback requested to version 9.3.0-SNAPSHOT

@cmacknz
Copy link
Member

cmacknz commented Oct 28, 2025

@ycombinator still has open comments to resolve, @ycombinator can you confirm these are addressed and approve? This LGTM but I want to avoid approving when there are open comments from someone else remaining.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila Thanks for addressing my comments. I've resolved most of them, just a couple where I'd like to see a bit more in the inline comments explaining the "why" + one other question about some code. After that, this LGTM!

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pchila

@pchila pchila requested a review from ycombinator November 4, 2025 15:06
Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pchila pchila merged commit a9ce37d into elastic:main Nov 5, 2025
21 checks passed
@pchila pchila mentioned this pull request Nov 21, 2025
8 tasks
hayotbisonai pushed a commit to hayotbisonai/elastic-agent that referenced this pull request Nov 23, 2025
* Allow for multiple directories to be specified during cleanup

* refactor manual rollback function and tests on a separate file

* Split manual rollback between watching and non-watching cases

* Implement manual rollback from list of agent installs

* fix lint errors

* Normalize install descriptors at startup

* Add integration test for manual rollback after grace period

* fix linter errors

* Set commit hash in TTLMarker when preparing available rollbacks

* Pass versionedHomesToKeep to installModifier.Cleanup()

* change check for running TTL marker normalization at startup

* remove references to install registry

* implement code review feedback

* fixup! implement code review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

4 participants