Add Jepsen tests to CI workflow #2774

pcholakov · 2025-02-24T07:45:55Z

Adds a step to run the https://github.com/restatedev/jepsen suite (and specifically, the set-vo and set-mds workloads) against open PRs.

This is now using the latest Jepsen actions and provisions a worker cluster per run, rather than sharing with other PRs. This will save us costs and removes the concurrency constraint.

I've managed to eliminate one of the causes of intermittent failures - sometimes the tests weren't waiting long enough for the cluster to get autoprovisioned; there is still one more failure which seems like a legit Restate issue where we don't always recover after a partitioning event. I've seen it maybe in 5% of ad-hoc runs? Warrants further investigation, but I don't think that should block us merging this.

The latest run-tests action now also adds a high-level summary to the CI checks page:

github-actions · 2025-02-24T08:40:08Z

Test Results

7 files ± 0 7 suites ±0 4m 26s ⏱️ + 1m 8s
54 tests + 9 53 ✅ + 9 1 💤 ±0 0 ❌ ±0
223 runs +49 220 ✅ +49 3 💤 ±0 0 ❌ ±0

Results for commit ee8cfe6. ± Comparison against base commit b7302a7.

This pull request removes 10 and adds 19 tests. Note that renamed tests count towards both.

dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[1]
dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[2]
dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[3]
dev.restate.sdktesting.tests.Ingress ‑ idempotentInvokeVirtualObject(URL, Client)
dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInCounterService(URL, int, Client)
dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInEventHandler(URL, int, Client)
dev.restate.sdktesting.tests.KillInvocation ‑ kill(Client, URL)
dev.restate.sdktesting.tests.PrivateService ‑ privateService(URL, Client)
dev.restate.sdktesting.tests.UpgradeWithInFlightInvocation ‑ inFlightInvocation(Client, URL)
dev.restate.sdktesting.tests.UpgradeWithNewInvocation ‑ executesNewInvocationWithLatestServiceRevisions(Client, URL)

dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwakeableTimeoutCommand(Client)
dev.restate.sdktesting.tests.Combinators ‑ firstSuccessfulCompletedAwakeable(Client)
dev.restate.sdktesting.tests.Ingress ‑ idempotentInvokeVirtualObject(URI, Client)
…

♻️ This comment has been updated with latest results.

pcholakov · 2025-02-26T13:32:37Z

.github/workflows/ci.yml

+      # additional features added for CI validation builds only
+      features: metadata-api


This is only enabled on the internal docker artifact we attach to the workflow run - same one that gets used by the SDK tests.

pcholakov · 2025-02-26T13:34:11Z

.github/workflows/ci.yml

+  jepsen:
+    needs: docker


This test currently runs after docker, just like the SDK tests. At the moment it takes ~3 min but we can tune it to run for shorter or longer. The downside is that there's a single Jepsen workers cluster backing it so we need to run multiple PRs sequentially. As long as there isn't a big backlog of PRs, it shouldn't add any more time to the overall PR checks duration.

What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?

Great question! The Jepsen control node needs to be singularly in control over things like network partitions; it's not that it's a heavy load, it's just that Jepsen test assumes it's the only one futzing with the infrastructure and we do things like arbitrarily kill processes or inject firewall rules to isolate nodes.

The actual Jepsen clusters are reasonably lightweight though, and we could have multiples of them pretty easily. Spinning up the stack per PR/test is viable - it takes a few minutes at the moment, but the setup can be masked by the docker step. It's a reasonably easy refactor, let me try that.

Resolved by provisioning a cluster per run. (And tearing it down afterwards.)

pcholakov · 2025-02-26T13:34:45Z

.github/workflows/ci.yml

+      id-token: write # NB: can obtain OIDC tokens on behalf of this repository!
+    steps:
+    - uses: restatedev/jepsen/.github/actions/run-tests@reusable
+      continue-on-error: true


I expect this will be flaky for a while - I believe this will make it report error but not block PR merging. But, I may be wrong about how this behaves :-)

You mentioned that you know of one instability. What is causing this instability right now?

There are actually two that I've seen happen; one seems like a legitimate issue where sometimes the cluster just doesn't come back healthy after the last network partition "healing". Here's an example from last night:

https://github.com/restatedev/jepsen/actions/runs/13555879460/job/37889949723#step:4:2394

I'm hoping I can fix this with tuning of delays, but maybe we have a real issue that needs a closer look. This seems reasonably under control.

The other issue happens sporadically - Jepsen itself crashes during the history check phase, for what appears to be a type mismatch error. As best as I have been able to read tea leaves, it's likely because I'm returning an event into the history which throws it off - and the annoying thing is that when it crashes, it doesn't write / save the history to easily debug it. I'll need to come up with a better strategy to get to the bottom of this one, so far it's eluded me.

tillrohrmann

Thanks for creating this PR @pcholakov. The changes look good to me. I had a question regarding the behavior of concurrency. If I understand things correctly, concurrent jobs will preempt each other if the jepsen job hasn't started yet. Was this your intention?

.github/workflows/ci.yml

tillrohrmann · 2025-02-27T15:29:25Z

.github/workflows/ci.yml

+      id-token: write # NB: can obtain OIDC tokens on behalf of this repository!
+    steps:
+    - uses: restatedev/jepsen/.github/actions/run-tests@reusable
+      continue-on-error: true


You mentioned that you know of one instability. What is causing this instability right now?

tillrohrmann · 2025-02-27T15:32:31Z

.github/workflows/ci.yml

+  jepsen:
+    needs: docker


What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?

pcholakov · 2025-02-27T19:19:24Z

If I understand things correctly, concurrent jobs will preempt each other if the jepsen job hasn't started yet. Was this your intention?

Thanks for flagging this! No, it was definitely not - this property of the concurrency limits is a surprise to me. I don't think we need to limit ourselves to just a single cluster though.

pcholakov · 2025-03-06T14:58:29Z

Ok; I think this is ready for merging at last. The key change since the previous review is that now we provision the test worker cluster on-demand, per PR. I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing. I have removed that checker for Virtual Object invocations only, and left a to-do in the Jepsen tests repo to get to the bottom of it - in the mean time, we still benefit from the correctness checks. (The metadata service does not appear to have this problem FWIW.)

…raint

tillrohrmann

Thanks for enabling the Jepsen tests to run on PRs and on CI runs @pcholakov. The changes look good to me :-) The one question I have is whether a failing Tear down Jepsen cluster step could leave some of the EC2 instances running. If so, is it possible to configure a maximum lifetime before they get shut down by AWS?

tillrohrmann · 2025-03-12T16:41:37Z

.github/workflows/ci.yml

+      - name: Tear down Jepsen cluster ${{ env.CLUSTER_NAME }}
+        uses: restatedev/jepsen/.github/actions/teardown@main
+        if: always()
+        with:
+          clusterName: ${{ env.CLUSTER_NAME }}


Can this fail and leave some AWS EC2 instances running?

It's absolutely possible, not very likely with the current stack structure, and not something I've seen so far. Definitely something to keep and eye out for with the stack-per-PR approach.

tillrohrmann · 2025-03-12T16:44:26Z

I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing.

Can we open a release-blocker issue to investigate this problem before the next minor release?

pcholakov · 2025-03-14T10:26:24Z

I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing.

Can we open a release-blocker issue to investigate this problem before the next minor release?

Done! #2906

The one question I have is whether a failing Tear down Jepsen cluster step could leave some of the EC2 instances running. If so, is it possible to configure a maximum lifetime before they get shut down by AWS?

I haven't seen this happen yet, but it's certainly a possibility! There is no built-in AWS feature to do anything like this, but we can tag the stack with a TTL and have a simple background task that prunes the expired as a backstop. We have daily AWS billing anomalies reporting and will spot this if it ever goes wrong.

The other wishlist feature I had here is to retain the bucket containing snapshots and metadata if tests fail which might be useful for investigations. I just wanted to ship what I had so far :-)

pcholakov force-pushed the chore/integrate-jepsen-checks branch from 56b3194 to 2e79431 Compare February 24, 2025 08:18

pcholakov force-pushed the chore/integrate-jepsen-checks branch from 1eac129 to 923afd0 Compare February 24, 2025 11:38

pcholakov changed the title ~~Add Jepsen tests to CI workflow (main repo branch)~~ Add Jepsen tests to CI workflow Feb 24, 2025

pcholakov added the jepsen-tests PRs run Jepsen tests in CI label Feb 25, 2025

pcholakov requested a review from tillrohrmann February 26, 2025 13:31

pcholakov marked this pull request as ready for review February 26, 2025 13:32

pcholakov commented Feb 26, 2025

View reviewed changes

pcholakov force-pushed the chore/integrate-jepsen-checks branch from 9737d1c to 8116cc0 Compare February 27, 2025 04:13

tillrohrmann reviewed Feb 27, 2025

View reviewed changes

pcholakov marked this pull request as draft March 1, 2025 10:10

pcholakov force-pushed the chore/integrate-jepsen-checks branch 3 times, most recently from 701a925 to e41b397 Compare March 5, 2025 16:21

pcholakov marked this pull request as ready for review March 6, 2025 14:55

tillrohrmann self-requested a review March 6, 2025 15:45

pcholakov added 12 commits March 7, 2025 13:14

Add Jepsen tests to CI workflow

b3d056e

Switch jepsen wf branch

4cf49c1

Cross-repo AWS creds

8a2ebb6

Use Jepsen action

2ba6cc7

remove assume-role which will fail now

92588a1

update branch ref

d0f34c6

use named action

89e26a3

assume jepsen actions role

2e7d98f

grant top-level id-token permission

f77e5a5

try removing job-level permissions for jepsen/run-tests

4b2def0

revert to job-level permissions only

dac6ab8

debug oidc token

1215c8e

pcholakov added 23 commits March 7, 2025 13:14

comments

f2d1067

update default cfg

f89d81e

set concurrency and jepsen cluster name

e857dec

update roleArn param

d68eacc

rename jepsen cluster

277ef17

switch to dedicated restate role

4bc027b

enable metadata-api for jepsen tests

b992e8a

make Jepsen tests job optional

dd95063

Check labels explicitly to handle retries

1a27c2b

fixup

ab49f81

fixup indent

6719763

fixup label check name

9aed8c6

disable label check

00474e8

fixup jepsen condition

220517e

Add the set-mds-s3 object-store backed workload to CI runs

334da1d

Create a Jepsen worker cluster per PR, removing the concurrency const…

ea01a58

…raint

specify @main branch for jepsen actions

4f98108

fixup docker check name

88bf900

sanitize aws creds around 3rd party action

d1f845e

fixup check name

fcf51fd

update drop AWS creds step

5f6ef6a

fixup one last time?

d0b7fa9

remove continue-on-error

ee8cfe6

pcholakov force-pushed the chore/integrate-jepsen-checks branch from d7a592b to ee8cfe6 Compare March 7, 2025 11:18

tillrohrmann approved these changes Mar 12, 2025

View reviewed changes

pcholakov mentioned this pull request Mar 14, 2025

Investigate unresponsive clusters following network partitions #2906

Closed

pcholakov merged commit e801f7c into main Mar 14, 2025
82 of 105 checks passed

pcholakov deleted the chore/integrate-jepsen-checks branch March 14, 2025 17:39

		# additional features added for CI validation builds only
		features: metadata-api

Add Jepsen tests to CI workflow #2774

Add Jepsen tests to CI workflow #2774

Uh oh!

Conversation

pcholakov commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcholakov commented Feb 27, 2025

Uh oh!

pcholakov commented Mar 6, 2025

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann commented Mar 12, 2025

Uh oh!

pcholakov commented Mar 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pcholakov commented Feb 24, 2025 •

edited

Loading

github-actions bot commented Feb 24, 2025 •

edited

Loading