Skip to content

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Feb 24, 2025

Adds a step to run the https://github.com/restatedev/jepsen suite (and specifically, the set-vo and set-mds workloads) against open PRs.

This is now using the latest Jepsen actions and provisions a worker cluster per run, rather than sharing with other PRs. This will save us costs and removes the concurrency constraint.

I've managed to eliminate one of the causes of intermittent failures - sometimes the tests weren't waiting long enough for the cluster to get autoprovisioned; there is still one more failure which seems like a legit Restate issue where we don't always recover after a partitioning event. I've seen it maybe in 5% of ad-hoc runs? Warrants further investigation, but I don't think that should block us merging this.

The latest run-tests action now also adds a high-level summary to the CI checks page:

image

@pcholakov pcholakov force-pushed the chore/integrate-jepsen-checks branch from 56b3194 to 2e79431 Compare February 24, 2025 08:18
Copy link

github-actions bot commented Feb 24, 2025

Test Results

  7 files  ± 0    7 suites  ±0   4m 26s ⏱️ + 1m 8s
 54 tests + 9   53 ✅ + 9  1 💤 ±0  0 ❌ ±0 
223 runs  +49  220 ✅ +49  3 💤 ±0  0 ❌ ±0 

Results for commit ee8cfe6. ± Comparison against base commit b7302a7.

This pull request removes 10 and adds 19 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[1]
dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[2]
dev.restate.sdktesting.tests.CancelInvocation ‑ cancelInvocation(BlockingOperation, Client, URL)[3]
dev.restate.sdktesting.tests.Ingress ‑ idempotentInvokeVirtualObject(URL, Client)
dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInCounterService(URL, int, Client)
dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInEventHandler(URL, int, Client)
dev.restate.sdktesting.tests.KillInvocation ‑ kill(Client, URL)
dev.restate.sdktesting.tests.PrivateService ‑ privateService(URL, Client)
dev.restate.sdktesting.tests.UpgradeWithInFlightInvocation ‑ inFlightInvocation(Client, URL)
dev.restate.sdktesting.tests.UpgradeWithNewInvocation ‑ executesNewInvocationWithLatestServiceRevisions(Client, URL)
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwakeableTimeoutCommand(Client)
dev.restate.sdktesting.tests.Combinators ‑ firstSuccessfulCompletedAwakeable(Client)
dev.restate.sdktesting.tests.Ingress ‑ idempotentInvokeVirtualObject(URI, Client)
…

♻️ This comment has been updated with latest results.

@pcholakov pcholakov force-pushed the chore/integrate-jepsen-checks branch from 1eac129 to 923afd0 Compare February 24, 2025 11:38
@pcholakov pcholakov changed the title Add Jepsen tests to CI workflow (main repo branch) Add Jepsen tests to CI workflow Feb 24, 2025
@pcholakov pcholakov added the jepsen-tests PRs run Jepsen tests in CI label Feb 25, 2025
@pcholakov pcholakov marked this pull request as ready for review February 26, 2025 13:32
Comment on lines +99 to +100
# additional features added for CI validation builds only
features: metadata-api
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only enabled on the internal docker artifact we attach to the workflow run - same one that gets used by the SDK tests.

Comment on lines 236 to 237
jepsen:
needs: docker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently runs after docker, just like the SDK tests. At the moment it takes ~3 min but we can tune it to run for shorter or longer. The downside is that there's a single Jepsen workers cluster backing it so we need to run multiple PRs sequentially. As long as there isn't a big backlog of PRs, it shouldn't add any more time to the overall PR checks duration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! The Jepsen control node needs to be singularly in control over things like network partitions; it's not that it's a heavy load, it's just that Jepsen test assumes it's the only one futzing with the infrastructure and we do things like arbitrarily kill processes or inject firewall rules to isolate nodes.

The actual Jepsen clusters are reasonably lightweight though, and we could have multiples of them pretty easily. Spinning up the stack per PR/test is viable - it takes a few minutes at the moment, but the setup can be masked by the docker step. It's a reasonably easy refactor, let me try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by provisioning a cluster per run. (And tearing it down afterwards.)

id-token: write # NB: can obtain OIDC tokens on behalf of this repository!
steps:
- uses: restatedev/jepsen/.github/actions/run-tests@reusable
continue-on-error: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this will be flaky for a while - I believe this will make it report error but not block PR merging. But, I may be wrong about how this behaves :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that you know of one instability. What is causing this instability right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two that I've seen happen; one seems like a legitimate issue where sometimes the cluster just doesn't come back healthy after the last network partition "healing". Here's an example from last night:

https://github.com/restatedev/jepsen/actions/runs/13555879460/job/37889949723#step:4:2394

I'm hoping I can fix this with tuning of delays, but maybe we have a real issue that needs a closer look. This seems reasonably under control.

The other issue happens sporadically - Jepsen itself crashes during the history check phase, for what appears to be a type mismatch error. As best as I have been able to read tea leaves, it's likely because I'm returning an event into the history which throws it off - and the annoying thing is that when it crashes, it doesn't write / save the history to easily debug it. I'll need to come up with a better strategy to get to the bottom of this one, so far it's eluded me.

@pcholakov pcholakov force-pushed the chore/integrate-jepsen-checks branch from 9737d1c to 8116cc0 Compare February 27, 2025 04:13
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @pcholakov. The changes look good to me. I had a question regarding the behavior of concurrency. If I understand things correctly, concurrent jobs will preempt each other if the jepsen job hasn't started yet. Was this your intention?

id-token: write # NB: can obtain OIDC tokens on behalf of this repository!
steps:
- uses: restatedev/jepsen/.github/actions/run-tests@reusable
continue-on-error: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that you know of one instability. What is causing this instability right now?

Comment on lines 236 to 237
jepsen:
needs: docker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?

@pcholakov
Copy link
Contributor Author

If I understand things correctly, concurrent jobs will preempt each other if the jepsen job hasn't started yet. Was this your intention?

Thanks for flagging this! No, it was definitely not - this property of the concurrency limits is a surprise to me. I don't think we need to limit ourselves to just a single cluster though.

@pcholakov pcholakov marked this pull request as draft March 1, 2025 10:10
@pcholakov pcholakov force-pushed the chore/integrate-jepsen-checks branch 3 times, most recently from 701a925 to e41b397 Compare March 5, 2025 16:21
@pcholakov pcholakov marked this pull request as ready for review March 6, 2025 14:55
@pcholakov
Copy link
Contributor Author

Ok; I think this is ready for merging at last. The key change since the previous review is that now we provision the test worker cluster on-demand, per PR. I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing. I have removed that checker for Virtual Object invocations only, and left a to-do in the Jepsen tests repo to get to the bottom of it - in the mean time, we still benefit from the correctness checks. (The metadata service does not appear to have this problem FWIW.)

@tillrohrmann tillrohrmann self-requested a review March 6, 2025 15:45
@pcholakov pcholakov force-pushed the chore/integrate-jepsen-checks branch from d7a592b to ee8cfe6 Compare March 7, 2025 11:18
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enabling the Jepsen tests to run on PRs and on CI runs @pcholakov. The changes look good to me :-) The one question I have is whether a failing Tear down Jepsen cluster step could leave some of the EC2 instances running. If so, is it possible to configure a maximum lifetime before they get shut down by AWS?

Comment on lines +295 to +299
- name: Tear down Jepsen cluster ${{ env.CLUSTER_NAME }}
uses: restatedev/jepsen/.github/actions/teardown@main
if: always()
with:
clusterName: ${{ env.CLUSTER_NAME }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this fail and leave some AWS EC2 instances running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely possible, not very likely with the current stack structure, and not something I've seen so far. Definitely something to keep and eye out for with the stack-per-PR approach.

@tillrohrmann
Copy link
Contributor

I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing.

Can we open a release-blocker issue to investigate this problem before the next minor release?

@pcholakov
Copy link
Contributor Author

I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing.

Can we open a release-blocker issue to investigate this problem before the next minor release?

Done! #2906

The one question I have is whether a failing Tear down Jepsen cluster step could leave some of the EC2 instances running. If so, is it possible to configure a maximum lifetime before they get shut down by AWS?

I haven't seen this happen yet, but it's certainly a possibility! There is no built-in AWS feature to do anything like this, but we can tag the stack with a TTL and have a simple background task that prunes the expired as a backstop. We have daily AWS billing anomalies reporting and will spot this if it ever goes wrong.

The other wishlist feature I had here is to retain the bucket containing snapshots and metadata if tests fail which might be useful for investigations. I just wanted to ship what I had so far :-)

@pcholakov pcholakov merged commit e801f7c into main Mar 14, 2025
82 of 105 checks passed
@pcholakov pcholakov deleted the chore/integrate-jepsen-checks branch March 14, 2025 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jepsen-tests PRs run Jepsen tests in CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants