-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry build when RemoteActionFileSystem
encounters a missing digest
#25358
base: master
Are you sure you want to change the base?
Conversation
aae13ad
to
7d8cbef
Compare
7d8cbef
to
9cb500c
Compare
This reverts commit 5cc0347.
RemoteActionFileSystem
encounters a missing digest
RemoteActionFileSystem
encounters a missing digestRemoteActionFileSystem
encounters a missing digest
@justinhorvitz Could you review the interaction with the action rewinding machinery, including the changes I had to make to |
@@ -639,9 +674,13 @@ private SpawnResult handleError( | |||
status = Status.EXECUTION_FAILED_CATASTROPHICALLY; | |||
detailedCode = FailureDetails.Spawn.Code.EXECUTION_FAILED; | |||
catastrophe = true; | |||
} else if (remoteCacheFailed) { | |||
} else if (BulkTransferException.allCausedByCacheNotFoundException(exception)) { | |||
// At this point, cache evictions that affect uploaded inputs have already been handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coeuvre This is a change in behavior, but I think it's for the better as it avoids retries that are very unlikely to succeed.
|
||
# Incremental build in toplevel build triggers remote cache eviction error | ||
# but Bazel doesn't automatically retry the build yet. | ||
# TODO: This documents the current behavior, but it's not intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinhorvitz To fix this, I would need to thread information about lost inputs obtained in
bazel/src/main/java/com/google/devtools/build/lib/skyframe/CompletionFunction.java
Line 375 in 998e762
ensureToplevelArtifacts(env, importantArtifacts, inputMap); |
ImportantOutputHandler
. I think I roughly understand what that would require, but the two calls to informImportantOutputHandler
further above and the mentioning of error bubbling tell me that this is probably pretty difficult to get right. Do you have any advice for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I like the direction that you make build rewinding more similar to action rewinding.
If it is not too difficult to do, I would like you to split this PR into 3 PRs for easier reviews:
- A PR that overhauls the build rewinding mechanism.
- A PR that fixes build rewinding for jdeps.
- A PR that contains remaining changes in this PR that don't belong to above 2.
ImmutableMap<String, ActionInput> newlyLostInputs = | ||
e.getLostInputs(inputArtifactData::getInput); | ||
if (!newlyLostInputs.isEmpty()) { | ||
if (lostInputs == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access to lostInputs
should be synchronized since an action may read its inputs concurrently.
throw new CacheNotFoundException(digest, path.getPathString()); | ||
throw new CacheNotFoundException( | ||
digest, | ||
context.getSpawnExecutionContext().getPathResolver().relativeToExecRoot(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to add new method relativeToExecRoot
to ArtifactPathResolver
for only this purpose.
An alternative is to pass in RemovePathResolver
and use that to convert local path to exec path, which is consistent with what you've done in RemoteExecutionService
.
throws IOException { | ||
// Stream the virtual action input as parameter files, which can be very large, are lazily | ||
// computed from the in-memory CommandLine object. This avoids allocating large byte arrays. | ||
public static Digest compute(StreamWriter input, HashFunction hashFunction) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like changes in this file are not relevant to the feature the PR aims for, can you split it out?
} | ||
} catch (UncheckedIOException e) { | ||
throw e.getCause(); | ||
// The cache value computation is potentially expensive, e.g. when we have to download an input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like changes in this file are not relevant to the feature the PR aims for, can you split it out?
@@ -921,6 +940,18 @@ protected void createHardLink(PathFragment linkPath, PathFragment originalPath) | |||
localFs.getPath(linkPath).createHardLink(getPath(originalPath)); | |||
} | |||
|
|||
public void checkForLostInputs(Action action) throws LostInputsActionExecutionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing some dots, but can you explain, before this PR, why Bazel didn't rewind the build? (the CacheNotFoundError was ignored by the call sites?)
Cache evictions encountered during reads of remote files in
RemoteActionFileSystem
now result in the build being retried when--experimental_remote_cache_eviction_retries
is set to a positive value (the default).This is enabled by aligning the logic behind "build rewinding" closer with that of "action rewinding". By throwing
LostInputExecException
s in the right locations and implementing thecheckForLostInputs
method on theRemoteOutputService
, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel.This change also adds a test case that demonstrates how top-level artifacts that have been cache evicted result in a build failure that isn't retried, which needs to be fixed by follow-up work.