-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remote persistent worker support #787
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from BuildBuddy side.
CI failed due to a Github outage yesterday, but I think a rebase + push should retry it.
examples/persistent_worker/README.md
Outdated
export BUILDBUDDY_CONTAINER_USER=... # GitHub user name | ||
export BUILDBUDDY_CONTAINER_PASSWORD=... # GitHub access token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to note that this is optional if the container image you are using is publicly downloadable.
remote_execution_properties = { | ||
"OSFamily": "Linux", | ||
"container-image": image, | ||
"workload-isolation-type": "podman", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: don't need to set isolation type specifically. We(BuildBuddy) may want to change the default isolation type underneath while maintaining backward compatibility. (In fact, we did recently stopped using podman as default isolation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to set it this way because I got a credentials error on image download with the default.
Thanks for the PR! This looks good. We started working on support for this internally but unfortunately it doesn't match the bazel spec exactly. For reasons that aren't entirely clear to me we can't attach the 'worker key' to the platform so it's attached elsewhere on the action. I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior. The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?). Similarly we can support both in 'bazel mode' though. Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', but I suppose we can deal with that when we get to it. If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is. |
That's great to hear! Yes, I was hoping for something along those lines.
In the Bazel version the worker key is used to associate a given action with a potentially already running worker instance on a remote executor node. But, it is not directly tied to any kind of previously uploaded blob. Bazel calculates a digest of the worker command and its inputs and uses that as a worker key. In this PR I went for the same approach.
That makes sense. I'll look into how to test this on the CI. |
I noticed that the example did not use the I will continue looking into ways to test this feature on CI. |
f179b29
to
f8c20ef
Compare
Just to give a heads up on expected progress here. I'm on leave for the next two weeks and will get back to this when I'm back. I've started making the setup independent on Nix, so that it is easier to integrate with CI here. That's already working locally. |
Hi @aherrmann, @KapJI is looking into setting up a build buddy account. |
d9c7b9b
to
e4e6667
Compare
@christolliday @KapJI I've rebased this PR and added the changes to make it independent of Nix so we no longer require a custom worker image. I've also added a CI test for the persistent worker examples, the test requires a GitHub secret named |
I added |
@KapJI Thank you! That sounds great, I don't think it should require any additional configuration. I'll test and debug the CI setup and let you know if I run into anything. |
One issue I'm encountering is that repository secrets are not exposed to GH actions runs that are initiated from forks (as is the case with this PR) for security reasons (see here). Here's what I've done now:
@KapJI could I ask you to trigger a CI run of this PR from within the Buck2 repo to test the remote execution cases? (After convincing yourself that this PR doesn't do anything dodgy with the token). |
@KapJI has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@aherrmann It's running on #800 |
@KapJI Thanks! Unfortunately, it looks like the token is still not available:
Is |
@aherrmann Sorry for the late reply. I fixed the secret and now it seems to run: https://github.com/facebook/buck2/actions/runs/12300843378/job/34330007436?pr=800
On #800 I also rebased PR to the latest version where anyhow is removed and applied some code formatting. |
62b3f8a
to
1ecb88d
Compare
@KapJI Thank you! I've updated this PR to capture the rebase and formatting fixes. I looked into the CI failure on #800. The issue seems to be that the actions are cached and don't report local or persistent worker execution as expected. I had tested this script with a read-only BuildBuddy token, but it looks like Buck2 CI has a read+write token. I've addressed this issue by adding a "cache buster". An env-var input to the action that is changed each time the test runs. I also tried the I've tested these changes against a read+write BuildBuddy token as well, and the tests passed. Hopefully they'll pass on Buck2 CI as well now. Please let me know if there is anything else I should address before this PR is ready for merge. |
Yes, random cache buster env var is a better choice. |
The linter errors on CI seem unrelated to the changes introduced by this PR. |
I'm merging fixes for failing lints right now. Once those are merged, let's rebase and make sure CI is passing. |
19e80bf
to
28975cb
Compare
@KapJI I've rebased on main. |
CI on main is passing now, can you please rebase again? |
28975cb
to
ba83bf9
Compare
@KapJI I rebased the PR. I also noticed that the |
Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.
The test wants to make sure that the actions are executed correct using either the remote persistent worker or running as individual actions on the remote execution system. Caching interferes with this test. This injects a cache-silo-key that changes each time to force a re-run of the action.
I noticed a discrepancy on external PR GitHub Actions runs vs. upstream main branch GitHub Actions runs: The main branch CI runs on Ubuntu 22.04, while external PRs run on Ubuntu 24.04. This causes CI failures due to version mismatches in the distribution package repository. External PR CI run setup: https://github.com/aherrmann/buck2/actions/runs/12751410326/job/35538421552#step:1:4 Main branch CI run setup: https://github.com/facebook/buck2/actions/runs/12749831677/job/35533176968#step:1:4 External PR CI failure: https://github.com/aherrmann/buck2/actions/runs/12751410326/job/35538421552#step:3:491 Main branch CI success: https://github.com/facebook/buck2/actions/runs/12749831677/job/35533176968#step:3:461
This reverts commit 1693bf1. No longer needed as of facebook#845
44325ec
to
de7f3c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aherrmann There is some feedback from internal review. Can you address it and come up with better names?
This looks good, the only thing that would be helpful is if the naming made clear that we are talking about Bazel's remote persistent worker protocol, since we're probably going to have to have separate support for our internal RE persistent workers.
I'll stick a couple of suggestions inline, but honestly they could probably be better
app/buck2_build_api/src/interpreter/rule_defs/provider/builtin/worker_info.rs
Outdated
Show resolved
Hide resolved
app/buck2_build_api/src/interpreter/rule_defs/command_executor_config.rs
Outdated
Show resolved
Hide resolved
@KapJI I've implemented the suggested change. I've tweaked the name slightly. I've also rebased the PR again. |
The latest CI failure is the same as on main and unrelated to this PR. |
Summary: Part of #787 Includes an example setup that works with - local builds without persistent worker - local builds with persistent worker (Buck2 protocol) - remote builds without persistent worker The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's WORKER_SOCKET, Bazel's --persistent_worker flag, or neither is set. The example includes a README with detailed instructions how to test this feature. - remote builds with persistent worker (Bazel protocol) Reviewed By: scottcao Differential Revision: D68157749 fbshipit-source-id: 51e2e247c75e0ca9736ddc0a5f383e662edee298
Closes #776.
Implements support for persistent workers in remote builds using the Bazel remote execution protocol and the approach documented in the Bazel remote persistent workers proposal:
https://github.com/bazelbuild/proposals/blob/main/designs/2021-03-06-remote-persistent-workers.md
Includes an example setup that works with
The Bazel remote persistent worker protocol includes an automatic fallback in cases where the remote execution system does not yet support persistent workers. To that end actions take the shape
The remote execution system separates worker arguments on the command-line from request arguments in the response file and adds the
--persistent_worker
flag.The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's
WORKER_SOCKET
, Bazel's--persistent_worker
flag, or neither is set.The example includes a README with detailed instructions how to test this feature.