Skip to content

Conversation

@m-sz
Copy link
Contributor

@m-sz m-sz commented Oct 31, 2025

Description

The zero downtime deployment for autopilot requires the newly created one to warm up its caches before the previous one can step down and leadership be transferred. Previous implementation reported readiness after taking over and the readiness probe was not used, making this mechanism fail to work.

Changes

Remove the usage of readiness probe in autopilot. Introduce startup probe. Signal startup finished once caches become warmed up, regardless of leader lock status.

How to test

  1. Deploy the PR along with infrastructure's https://github.com/cowprotocol/infrastructure/pull/3855 to staging
  2. Observe previous autopilot to shut down down only after the new one finishes warming up its caches.
  3. The transition between old autopilot shutdown and new acquiring leader lock should be quick (a couple of milliseconds).

@m-sz m-sz requested a review from a team as a code owner October 31, 2025 15:37
@m-sz m-sz marked this pull request as draft October 31, 2025 15:39
@fafk
Copy link
Contributor

fafk commented Nov 3, 2025

LGTM, I think it's exactly as discussed. Approved, just pls add a PR description. 🤠

@m-sz m-sz force-pushed the autopilot-startup-probe branch 4 times, most recently from dc878bb to 93786e0 Compare November 4, 2025 13:03
@m-sz m-sz marked this pull request as ready for review November 4, 2025 13:26
@m-sz
Copy link
Contributor Author

m-sz commented Nov 4, 2025

I have done some test deployments on sepolia and here are my findings:

Tests done on sepolia-staging

last auction by prev. shutdown signal stepped up new auction auction-to-auction time [s]
2025-11-04T13:46:03.255Z 2025-11-04T13:46:05.142Z 2025-11-04T13:46:13.658Z 2025-11-04T13:46:13.695Z 10
2025-11-04T13:47:03.623Z 2025-11-04T13:47:05.425Z 2025-11-04T13:47:13.584Z 2025-11-04T13:47:13.604Z 10
2025-11-04T13:48:38.718Z 2025-11-04T13:48:40.704Z 2025-11-04T13:48:53.491Z 2025-11-04T13:48:53.534Z 15
2025-11-04T13:50:13.584Z 2025-11-04T13:50:17.394Z 2025-11-04T13:50:25.719Z 2025-11-04T13:50:26.156Z 13
2025-11-04T13:51:40.366Z 2025-11-04T13:51:43.489Z 2025-11-04T13:51:49.987Z 2025-11-04T13:51:50.002Z 10

10 auctions average auction-to-auction: 12s

Which shows that the fix is working as expected.

Copy link
Contributor

@jmg-duarte jmg-duarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads much better, some nits

@m-sz
Copy link
Contributor Author

m-sz commented Nov 5, 2025

Updates PR based on comments, will re-test on staging to sign off.

@m-sz m-sz force-pushed the autopilot-startup-probe branch from eabe512 to b5d1ea4 Compare November 5, 2025 10:36
@m-sz m-sz requested a review from jmg-duarte November 5, 2025 10:52
Copy link
Contributor

@squadgazzz squadgazzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, only nits.

@m-sz
Copy link
Contributor Author

m-sz commented Nov 5, 2025

Latest timings

   last auction by prev.	     shutdown signal	             stepped up	            new auction	       auction-to-auction time [s]
2025-11-05T11:48:00.710Z	2025-11-05T11:48:04.445Z	2025-11-05T11:48:17.259Z	2025-11-05T11:48:26.421Z	17
2025-11-05T11:51:01.936Z	2025-11-05T11:51:04.740Z	2025-11-05T11:51:05.280Z	2025-11-05T11:51:15.203Z	14
2025-11-05T11:53:39.920Z	2025-11-05T11:53:42.283Z	2025-11-05T11:53:51.201Z	2025-11-05T11:53:51.850Z	12
2025-11-05T11:56:05.834Z	2025-11-05T11:56:08.628Z	2025-11-05T11:56:16.781Z	2025-11-05T11:56:16.816Z	11
2025-11-05T11:59:41.944Z	2025-11-05T11:59:44.769Z	2025-11-05T11:59:50.876Z	2025-11-05T11:59:50.890Z	9

Looks like the latest revision is also working as expected. Merging the PR.

@m-sz m-sz added this pull request to the merge queue Nov 5, 2025
Merged via the queue into main with commit 5090dc1 Nov 5, 2025
18 checks passed
@m-sz m-sz deleted the autopilot-startup-probe branch November 5, 2025 12:17
@github-actions github-actions bot locked and limited conversation to collaborators Nov 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants