Add remote persistent worker support #787

aherrmann · 2024-09-25T14:57:05Z

Closes #776.

Implements support for persistent workers in remote builds using the Bazel remote execution protocol and the approach documented in the Bazel remote persistent workers proposal:
https://github.com/bazelbuild/proposals/blob/main/designs/2021-03-06-remote-persistent-workers.md

Includes an example setup that works with

local builds without persistent worker
local builds with persistent worker (Buck2 protocol)
remote builds without persistent worker
remote builds with persistent worker (Bazel protocol)

The Bazel remote persistent worker protocol includes an automatic fallback in cases where the remote execution system does not yet support persistent workers. To that end actions take the shape

WORKER WORKER_ARGS... @REQUEST_ARGS_FILE

The remote execution system separates worker arguments on the command-line from request arguments in the response file and adds the --persistent_worker flag.

The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's WORKER_SOCKET, Bazel's --persistent_worker flag, or neither is set.

The example includes a README with detailed instructions how to test this feature.

Add example for remote persistent workers
Implement support for remote persistent workers

sluongng

LGTM from BuildBuddy side.

CI failed due to a Github outage yesterday, but I think a rebase + push should retry it.

sluongng · 2024-09-26T09:48:19Z

examples/persistent_worker/README.md

+export BUILDBUDDY_CONTAINER_USER=...  # GitHub user name
+export BUILDBUDDY_CONTAINER_PASSWORD=...  # GitHub access token


I just want to note that this is optional if the container image you are using is publicly downloadable.

sluongng · 2024-09-26T09:50:41Z

examples/persistent_worker/platforms/buildbuddy.bzl

+            remote_execution_properties = {
+                "OSFamily": "Linux",
+                "container-image": image,
+                "workload-isolation-type": "podman",


Nit: don't need to set isolation type specifically. We(BuildBuddy) may want to change the default isolation type underneath while maintaining backward compatibility. (In fact, we did recently stopped using podman as default isolation).

I had to set it this way because I got a credentials error on image download with the default.

christolliday · 2024-10-01T20:55:20Z

Thanks for the PR! This looks good.

We started working on support for this internally but unfortunately it doesn't match the bazel spec exactly. For reasons that aren't entirely clear to me we can't attach the 'worker key' to the platform so it's attached elsewhere on the action.

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?). Similarly we can support both in 'bazel mode' though.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', but I suppose we can deal with that when we get to it.

If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

aherrmann · 2024-10-02T07:47:08Z

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior, though.

That's great to hear! Yes, I was hoping for something along those lines.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?).

In the Bazel version the worker key is used to associate a given action with a potentially already running worker instance on a remote executor node. But, it is not directly tied to any kind of previously uploaded blob. Bazel calculates a digest of the worker command and its inputs and uses that as a worker key. In this PR I went for the same approach.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', [...] If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

That makes sense. I'll look into how to test this on the CI.

aherrmann · 2024-10-04T15:24:09Z

I noticed that the example did not use the WorkerRunInfo attributes worker and exe appropriately to distinguish between (remote) persistent worker and non-worker modes respectively. I've updated the example accordingly. This highlighted that the implementation did wrongly use request.all_args_vec for the remote persistent worker case, when it should be composing worker.exe and request.args instead. I've updated the implementation accordingly.

I will continue looking into ways to test this feature on CI.

aherrmann · 2024-10-11T09:35:42Z

Just to give a heads up on expected progress here. I'm on leave for the next two weeks and will get back to this when I'm back.

I've started making the setup independent on Nix, so that it is easier to integrate with CI here. That's already working locally.
The other thing is the remote execution system to test the remote persistent worker mode. I spoke with BuildBuddy, it would be possible to use BuildBuddy for this CI use-case, but it would require setting up a free account for the Meta Buck2 repository and configuring an access token for CI. @christolliday would that be possible for Meta to do? Otherwise, I'd have to look into other options to set up a compatible remote execution system on CI.

christolliday · 2024-10-28T02:51:42Z

Hi @aherrmann, @KapJI is looking into setting up a build buddy account.

aherrmann · 2024-10-29T16:04:36Z

@christolliday @KapJI I've rebased this PR and added the changes to make it independent of Nix so we no longer require a custom worker image. I've also added a CI test for the persistent worker examples, the test requires a GitHub secret named BUILDBUDDY_API_KEY to hold the BuildBuddy token. I've tested the test script locally, but things may still fail under the GitHub actions environment. Please let me know when the BuildBuddy account is set up and a token is added as a GitHub secret, then I can test and debug the CI configuration.

KapJI · 2024-10-29T16:18:19Z

I added BUILDBUDDY_API_KEY which holds my 20 chars API token to our github secrets. Does my BuildBuddy account need any extra setup?

aherrmann · 2024-10-29T16:51:01Z

@KapJI Thank you! That sounds great, I don't think it should require any additional configuration. I'll test and debug the CI setup and let you know if I run into anything.

aherrmann · 2024-10-30T15:44:43Z

One issue I'm encountering is that repository secrets are not exposed to GH actions runs that are initiated from forks (as is the case with this PR) for security reasons (see here). Here's what I've done now:

I've added a CI step for the persistent worker steps.
In there I check if the token is available or not. If not, I skip the remote execution tests and generate a GH actions notice annotation. So, by default the remote execution cases will only be checked on main CI or PRs coming from Meta engineers.
I've added the workflow_dispatch trigger to the CI workflow, this should allow Meta engineers to manually trigger a CI run that should have the token set (will only be available once the workflow_dispatch trigger is merged). E.g. to test that an external PR doesn't break these tests.

@KapJI could I ask you to trigger a CI run of this PR from within the Buck2 repo to test the remote execution cases? (After convincing yourself that this PR doesn't do anything dodgy with the token).
You can do this by pulling this PR's branch and then pushing it to the facebook/buck2 repo (not onto the main branch, just as a separate branch). Something along the lines of gh pr checkout 787; git push origin persistent-remote-worker. I'd expect the push trigger to fire at that point, if not you may need to add a dummy commit.

facebook-github-bot · 2024-10-30T15:54:29Z

@KapJI has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

KapJI · 2024-10-30T16:25:10Z

@aherrmann It's running on #800

aherrmann · 2024-10-31T10:04:48Z

@KapJI Thanks! Unfortunately, it looks like the token is still not available:

It is empty in the env section
The RE tests are skipped.

Is BUILDBUDDY_API_KEY configured as a repository secret? Does gh secret list include it in its output?

aherrmann · 2024-11-11T09:10:40Z

@KapJI friendly ping, did you have a chance to look into the above?

KapJI · 2024-12-12T17:08:01Z

@aherrmann Sorry for the late reply. I fixed the secret and now it seems to run: https://github.com/facebook/buck2/actions/runs/12300843378/job/34330007436?pr=800

Run ./.github/actions/build_example_persistent_worker
  with:
    buildbuddyApiKey: ***

On #800 I also rebased PR to the latest version where anyhow is removed and applied some code formatting.

aherrmann · 2024-12-18T10:37:52Z

@KapJI Thank you! I've updated this PR to capture the rebase and formatting fixes.

I looked into the CI failure on #800. The issue seems to be that the actions are cached and don't report local or persistent worker execution as expected. I had tested this script with a read-only BuildBuddy token, but it looks like Buck2 CI has a read+write token.

I've addressed this issue by adding a "cache buster". An env-var input to the action that is changed each time the test runs. I also tried the --no-remote-cache flag to Buck2, but that had the unwanted side effect of completely omitting the action from the buck2 log what-ran output, which the test uses to determine what ran how.

I've tested these changes against a read+write BuildBuddy token as well, and the tests passed. Hopefully they'll pass on Buck2 CI as well now. Please let me know if there is anything else I should address before this PR is ready for merge.

KapJI · 2024-12-18T10:39:46Z

Yes, random cache buster env var is a better choice.

aherrmann · 2024-12-18T12:15:29Z

The linter errors on CI seem unrelated to the changes introduced by this PR.

.github/workflows/build-and-test.yml

app/buck2_execute/src/execute/command_executor.rs

app/buck2_execute/src/lib.rs

examples/persistent_worker/.buckconfig

KapJI · 2024-12-18T15:33:00Z

I'm merging fixes for failing lints right now. Once those are merged, let's rebase and make sure CI is passing.

aherrmann · 2024-12-19T14:49:19Z

@KapJI I've rebased on main.
Linux CI passes.
The MacOS build and test pipeline fails with the same failure that exists on main. So, that should be unrelated to this PR.
The Windows build examples pipeline fails with also fails with the same failure that exists on main. So, that should also be unrelated to this PR.

KapJI · 2025-01-13T11:03:39Z

CI on main is passing now, can you please rebase again?

aherrmann · 2025-01-14T12:22:29Z

@KapJI I rebased the PR. I also noticed that the linux-build-examples step was failing on the PR because it was using a different Ubuntu version than the main branch pipeline. I've aligned that in 44325ec, see commit message for further details.
CI is now green apart from Facebook Internal checks, which I can't see.

Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.

The test wants to make sure that the actions are executed correct using either the remote persistent worker or running as individual actions on the remote execution system. Caching interferes with this test. This injects a cache-silo-key that changes each time to force a re-run of the action.

See 779fead.

I noticed a discrepancy on external PR GitHub Actions runs vs. upstream main branch GitHub Actions runs: The main branch CI runs on Ubuntu 22.04, while external PRs run on Ubuntu 24.04. This causes CI failures due to version mismatches in the distribution package repository. External PR CI run setup: https://github.com/aherrmann/buck2/actions/runs/12751410326/job/35538421552#step:1:4 Main branch CI run setup: https://github.com/facebook/buck2/actions/runs/12749831677/job/35533176968#step:1:4 External PR CI failure: https://github.com/aherrmann/buck2/actions/runs/12751410326/job/35538421552#step:3:491 Main branch CI success: https://github.com/facebook/buck2/actions/runs/12749831677/job/35533176968#step:3:461

This reverts commit 1693bf1. No longer needed as of facebook#845

see 15d70a3

KapJI

@aherrmann There is some feedback from internal review. Can you address it and come up with better names?

This looks good, the only thing that would be helpful is if the naming made clear that we are talking about Bazel's remote persistent worker protocol, since we're probably going to have to have separate support for our internal RE persistent workers.
I'll stick a couple of suggestions inline, but honestly they could probably be better

app/buck2_build_api/src/interpreter/rule_defs/provider/builtin/worker_info.rs

app/buck2_build_api/src/interpreter/rule_defs/command_executor_config.rs

Adresses facebook#787 (comment)

Addresses facebook#787 (comment)

aherrmann · 2025-01-20T14:05:22Z

@KapJI I've implemented the suggested change. I've tweaked the name slightly. I've also rebased the PR again.

aherrmann · 2025-01-20T17:03:03Z

The latest CI failure is the same as on main and unrelated to this PR.

Summary: Part of #787 Includes an example setup that works with - local builds without persistent worker - local builds with persistent worker (Buck2 protocol) - remote builds without persistent worker The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's WORKER_SOCKET, Bazel's --persistent_worker flag, or neither is set. The example includes a README with detailed instructions how to test this feature. - remote builds with persistent worker (Bazel protocol) Reviewed By: scottcao Differential Revision: D68157749 fbshipit-source-id: 51e2e247c75e0ca9736ddc0a5f383e662edee298

facebook-github-bot · 2025-01-20T20:00:10Z

@KapJI merged this pull request in df48a53.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2024

sluongng reviewed Sep 26, 2024

View reviewed changes

aherrmann force-pushed the persistent-remote-worker branch from f179b29 to f8c20ef Compare October 7, 2024 09:04

aherrmann force-pushed the persistent-remote-worker branch from d9c7b9b to e4e6667 Compare October 29, 2024 14:59

aherrmann mentioned this pull request Nov 12, 2024

feature request - support remote persistent workers #776

Closed

aherrmann force-pushed the persistent-remote-worker branch from 62b3f8a to 1ecb88d Compare December 18, 2024 10:32

KapJI reviewed Dec 18, 2024

View reviewed changes

.github/workflows/build-and-test.yml Show resolved Hide resolved

app/buck2_execute/src/execute/command_executor.rs Show resolved Hide resolved

app/buck2_execute/src/lib.rs Show resolved Hide resolved

examples/persistent_worker/.buckconfig Outdated Show resolved Hide resolved

aherrmann force-pushed the persistent-remote-worker branch from 19e80bf to 28975cb Compare December 19, 2024 13:43

aherrmann force-pushed the persistent-remote-worker branch from 28975cb to ba83bf9 Compare January 13, 2025 16:15

aherrmann added 14 commits January 20, 2025 10:35

Test persistent worker example on CI

1977ac2

Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.

fix typo

f1fd231

Remove old Nix toolchain configuration file

34f7ade

close GH actions output groups

f7fc5f7

Generate GH actions annotations on missing token

e259d89

Document BuildBuddy token availability

e496262

Enable manual pipeline runs

bf8580e

Fix annotations file path

a4e55d7

align formatting

3b33fb6

Fix extra space

72e35ad

Update persistent worker example extract_archive

d249932

See 779fead.

Revert "Pin linux-build-examples Ubuntu version"

de7f3c0

This reverts commit 1693bf1. No longer needed as of facebook#845

aherrmann force-pushed the persistent-remote-worker branch from 44325ec to de7f3c0 Compare January 20, 2025 09:36

buck2_error! signature changed

d9a7303

see 15d70a3

KapJI reviewed Jan 20, 2025

View reviewed changes

app/buck2_build_api/src/interpreter/rule_defs/provider/builtin/worker_info.rs Outdated Show resolved Hide resolved

app/buck2_build_api/src/interpreter/rule_defs/command_executor_config.rs Outdated Show resolved Hide resolved

aherrmann added 3 commits January 20, 2025 14:46

Explicit Bazel remote persistent worker support

19e9583

Adresses facebook#787 (comment)

Explicit Bazel remote persistent worker support

6a3e3dd

Addresses facebook#787 (comment)

Update remote persistent worker example

8204f04

aherrmann added 2 commits January 20, 2025 15:34

fix missing field update

6cee324

update test-case

61b9561

facebook-github-bot closed this in df48a53 Jan 20, 2025

facebook-github-bot added the Merged label Jan 20, 2025

aherrmann deleted the persistent-remote-worker branch January 21, 2025 09:49

		export BUILDBUDDY_CONTAINER_USER=... # GitHub user name
		export BUILDBUDDY_CONTAINER_PASSWORD=... # GitHub access token

Add remote persistent worker support #787

Add remote persistent worker support #787

Uh oh!

Conversation

aherrmann commented Sep 25, 2024

Uh oh!

sluongng left a comment

Choose a reason for hiding this comment

Uh oh!

sluongng Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

sluongng Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

aherrmann Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

christolliday commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aherrmann commented Oct 2, 2024

Uh oh!

aherrmann commented Oct 4, 2024

Uh oh!

aherrmann commented Oct 11, 2024

Uh oh!

christolliday commented Oct 28, 2024

Uh oh!

aherrmann commented Oct 29, 2024

Uh oh!

KapJI commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aherrmann commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aherrmann commented Oct 30, 2024

Uh oh!

facebook-github-bot commented Oct 30, 2024

Uh oh!

KapJI commented Oct 30, 2024

Uh oh!

aherrmann commented Oct 31, 2024

Uh oh!

aherrmann commented Nov 11, 2024

Uh oh!

KapJI commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aherrmann commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KapJI commented Dec 18, 2024

Uh oh!

aherrmann commented Dec 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KapJI commented Dec 18, 2024

Uh oh!

aherrmann commented Dec 19, 2024

Uh oh!

KapJI commented Jan 13, 2025

Uh oh!

aherrmann commented Jan 14, 2025

Uh oh!

KapJI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aherrmann commented Jan 20, 2025

Uh oh!

aherrmann commented Jan 20, 2025

Uh oh!

facebook-github-bot commented Jan 20, 2025

Uh oh!

christolliday commented Oct 1, 2024 •

edited

Loading

KapJI commented Oct 29, 2024 •

edited

Loading

aherrmann commented Oct 29, 2024 •

edited

Loading

KapJI commented Dec 12, 2024 •

edited

Loading

aherrmann commented Dec 18, 2024 •

edited

Loading