-
Notifications
You must be signed in to change notification settings - Fork 109
Add Jepsen tests to CI workflow #2774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
56b3194
to
2e79431
Compare
Test Results 7 files ± 0 7 suites ±0 4m 26s ⏱️ + 1m 8s Results for commit ee8cfe6. ± Comparison against base commit b7302a7. This pull request removes 10 and adds 19 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
1eac129
to
923afd0
Compare
# additional features added for CI validation builds only | ||
features: metadata-api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only enabled on the internal docker artifact we attach to the workflow run - same one that gets used by the SDK tests.
.github/workflows/ci.yml
Outdated
jepsen: | ||
needs: docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test currently runs after docker
, just like the SDK tests. At the moment it takes ~3 min but we can tune it to run for shorter or longer. The downside is that there's a single Jepsen workers cluster backing it so we need to run multiple PRs sequentially. As long as there isn't a big backlog of PRs, it shouldn't add any more time to the overall PR checks duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! The Jepsen control node needs to be singularly in control over things like network partitions; it's not that it's a heavy load, it's just that Jepsen test assumes it's the only one futzing with the infrastructure and we do things like arbitrarily kill processes or inject firewall rules to isolate nodes.
The actual Jepsen clusters are reasonably lightweight though, and we could have multiples of them pretty easily. Spinning up the stack per PR/test is viable - it takes a few minutes at the moment, but the setup can be masked by the docker
step. It's a reasonably easy refactor, let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by provisioning a cluster per run. (And tearing it down afterwards.)
.github/workflows/ci.yml
Outdated
id-token: write # NB: can obtain OIDC tokens on behalf of this repository! | ||
steps: | ||
- uses: restatedev/jepsen/.github/actions/run-tests@reusable | ||
continue-on-error: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this will be flaky for a while - I believe this will make it report error but not block PR merging. But, I may be wrong about how this behaves :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that you know of one instability. What is causing this instability right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two that I've seen happen; one seems like a legitimate issue where sometimes the cluster just doesn't come back healthy after the last network partition "healing". Here's an example from last night:
https://github.com/restatedev/jepsen/actions/runs/13555879460/job/37889949723#step:4:2394
I'm hoping I can fix this with tuning of delays, but maybe we have a real issue that needs a closer look. This seems reasonably under control.
The other issue happens sporadically - Jepsen itself crashes during the history check phase, for what appears to be a type mismatch error. As best as I have been able to read tea leaves, it's likely because I'm returning an event into the history which throws it off - and the annoying thing is that when it crashes, it doesn't write / save the history to easily debug it. I'll need to come up with a better strategy to get to the bottom of this one, so far it's eluded me.
9737d1c
to
8116cc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @pcholakov. The changes look good to me. I had a question regarding the behavior of concurrency
. If I understand things correctly, concurrent jobs will preempt each other if the jepsen
job hasn't started yet. Was this your intention?
.github/workflows/ci.yml
Outdated
id-token: write # NB: can obtain OIDC tokens on behalf of this repository! | ||
steps: | ||
- uses: restatedev/jepsen/.github/actions/run-tests@reusable | ||
continue-on-error: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that you know of one instability. What is causing this instability right now?
.github/workflows/ci.yml
Outdated
jepsen: | ||
needs: docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem with running multiple Jepsen test instances concurrently on the Jepsen test cluster? Would we deplete the resources of the cluster? Is it a problem of isolation between different Jepsen runs?
Thanks for flagging this! No, it was definitely not - this property of the concurrency limits is a surprise to me. I don't think we need to limit ourselves to just a single cluster though. |
701a925
to
e41b397
Compare
Ok; I think this is ready for merging at last. The key change since the previous review is that now we provision the test worker cluster on-demand, per PR. I have seen one issue which intermittently happens with network partitions, which is that we occasionally still do not recover in the 20s following a partition healing. I have removed that checker for Virtual Object invocations only, and left a to-do in the Jepsen tests repo to get to the bottom of it - in the mean time, we still benefit from the correctness checks. (The metadata service does not appear to have this problem FWIW.) |
d7a592b
to
ee8cfe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for enabling the Jepsen tests to run on PRs and on CI runs @pcholakov. The changes look good to me :-) The one question I have is whether a failing Tear down Jepsen cluster
step could leave some of the EC2 instances running. If so, is it possible to configure a maximum lifetime before they get shut down by AWS?
- name: Tear down Jepsen cluster ${{ env.CLUSTER_NAME }} | ||
uses: restatedev/jepsen/.github/actions/teardown@main | ||
if: always() | ||
with: | ||
clusterName: ${{ env.CLUSTER_NAME }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this fail and leave some AWS EC2 instances running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absolutely possible, not very likely with the current stack structure, and not something I've seen so far. Definitely something to keep and eye out for with the stack-per-PR approach.
Can we open a release-blocker issue to investigate this problem before the next minor release? |
Done! #2906
I haven't seen this happen yet, but it's certainly a possibility! There is no built-in AWS feature to do anything like this, but we can tag the stack with a TTL and have a simple background task that prunes the expired as a backstop. We have daily AWS billing anomalies reporting and will spot this if it ever goes wrong. The other wishlist feature I had here is to retain the bucket containing snapshots and metadata if tests fail which might be useful for investigations. I just wanted to ship what I had so far :-) |
Adds a step to run the https://github.com/restatedev/jepsen suite (and specifically, the
set-vo
andset-mds
workloads) against open PRs.This is now using the latest Jepsen actions and provisions a worker cluster per run, rather than sharing with other PRs. This will save us costs and removes the concurrency constraint.
I've managed to eliminate one of the causes of intermittent failures - sometimes the tests weren't waiting long enough for the cluster to get autoprovisioned; there is still one more failure which seems like a legit Restate issue where we don't always recover after a partitioning event. I've seen it maybe in 5% of ad-hoc runs? Warrants further investigation, but I don't think that should block us merging this.
The latest run-tests action now also adds a high-level summary to the CI checks page: