Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry build when RemoteActionFileSystem encounters a missing digest #25358

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Feb 21, 2025

Cache evictions encountered during reads of remote files in RemoteActionFileSystem now result in the build being retried when --experimental_remote_cache_eviction_retries is set to a positive value (the default).

This is enabled by aligning the logic behind "build rewinding" closer with that of "action rewinding". By throwing LostInputExecExceptions in the right locations and implementing the checkForLostInputs method on the RemoteOutputService, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel.

This change also adds a test case that demonstrates how top-level artifacts that have been cache evicted result in a build failure that isn't retried, which needs to be fixed by follow-up work.

@fmeum fmeum force-pushed the build-rewinding-jdeps branch 3 times, most recently from aae13ad to 7d8cbef Compare February 22, 2025 16:26
@fmeum fmeum force-pushed the build-rewinding-jdeps branch from 7d8cbef to 9cb500c Compare February 24, 2025 12:28
@fmeum fmeum changed the title Build rewinding jdeps Rewind builds when the RemoteActionFileSystem encounters a missing digest Feb 25, 2025
@fmeum fmeum changed the title Rewind builds when the RemoteActionFileSystem encounters a missing digest Retry build when RemoteActionFileSystem encounters a missing digest Feb 25, 2025
@fmeum fmeum marked this pull request as ready for review February 25, 2025 11:56
@fmeum fmeum requested review from a team and lberki as code owners February 25, 2025 11:56
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Rules-Java Issues for Java rules team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Feb 25, 2025
@fmeum fmeum removed the request for review from a team February 25, 2025 11:57
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Feb 25, 2025
@fmeum fmeum removed the request for review from lberki February 25, 2025 11:57
@fmeum
Copy link
Collaborator Author

fmeum commented Feb 25, 2025

@justinhorvitz Could you review the interaction with the action rewinding machinery, including the changes I had to make to RewindingTest?
@coeuvre Could you review the rest, i.e., the integration with build rewinding and the remote module?

@@ -639,9 +674,13 @@ private SpawnResult handleError(
status = Status.EXECUTION_FAILED_CATASTROPHICALLY;
detailedCode = FailureDetails.Spawn.Code.EXECUTION_FAILED;
catastrophe = true;
} else if (remoteCacheFailed) {
} else if (BulkTransferException.allCausedByCacheNotFoundException(exception)) {
// At this point, cache evictions that affect uploaded inputs have already been handled.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coeuvre This is a change in behavior, but I think it's for the better as it avoids retries that are very unlikely to succeed.


# Incremental build in toplevel build triggers remote cache eviction error
# but Bazel doesn't automatically retry the build yet.
# TODO: This documents the current behavior, but it's not intended.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinhorvitz To fix this, I would need to thread information about lost inputs obtained in

ensureToplevelArtifacts(env, importantArtifacts, inputMap);
through to the ImportantOutputHandler. I think I roughly understand what that would require, but the two calls to informImportantOutputHandler further above and the mentioning of error bubbling tell me that this is probably pretty difficult to get right. Do you have any advice for me?

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I like the direction that you make build rewinding more similar to action rewinding.

If it is not too difficult to do, I would like you to split this PR into 3 PRs for easier reviews:

  1. A PR that overhauls the build rewinding mechanism.
  2. A PR that fixes build rewinding for jdeps.
  3. A PR that contains remaining changes in this PR that don't belong to above 2.

ImmutableMap<String, ActionInput> newlyLostInputs =
e.getLostInputs(inputArtifactData::getInput);
if (!newlyLostInputs.isEmpty()) {
if (lostInputs == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to lostInputs should be synchronized since an action may read its inputs concurrently.

throw new CacheNotFoundException(digest, path.getPathString());
throw new CacheNotFoundException(
digest,
context.getSpawnExecutionContext().getPathResolver().relativeToExecRoot(path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to add new method relativeToExecRoot to ArtifactPathResolver for only this purpose.

An alternative is to pass in RemovePathResolver and use that to convert local path to exec path, which is consistent with what you've done in RemoteExecutionService.

throws IOException {
// Stream the virtual action input as parameter files, which can be very large, are lazily
// computed from the in-memory CommandLine object. This avoids allocating large byte arrays.
public static Digest compute(StreamWriter input, HashFunction hashFunction) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like changes in this file are not relevant to the feature the PR aims for, can you split it out?

}
} catch (UncheckedIOException e) {
throw e.getCause();
// The cache value computation is potentially expensive, e.g. when we have to download an input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like changes in this file are not relevant to the feature the PR aims for, can you split it out?

@@ -921,6 +940,18 @@ protected void createHardLink(PathFragment linkPath, PathFragment originalPath)
localFs.getPath(linkPath).createHardLink(getPath(originalPath));
}

public void checkForLostInputs(Action action) throws LostInputsActionExecutionException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing some dots, but can you explain, before this PR, why Bazel didn't rewind the build? (the CacheNotFoundError was ignored by the call sites?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-Java Issues for Java rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants