action rewinding: recover lost CAS blobs by sluongng · Pull Request #1349 · facebook/buck2

sluongng · 2026-06-28T15:56:08Z

Remote CAS loss could leave deferred builds unable to recover generated
inputs or final outputs without a daemon restart. This matters because
the build graph still contains producer actions that can recreate those
blobs, but Buck was not always turning missing-CAS errors into graph
rewinds.

This is the Buck2 equivalent of Bazel's action-rewinding path. Bazel
added --rewind_lost_inputs in bazelbuild/bazel#25477, distinguishing
single-build action rewinding from build-level retries via
--experimental_remote_cache_eviction_retries. That Bazel work also
called out concurrency-sensitive cases around arbitrary --jobs values
and async cache uploads through --remote_cache_async.

This change tracks lost inputs as a typed batch so the build command can
dirty all producer BuildKey nodes and the consumer in one rewind.
Canonicalize rewind keys through the registered action lookup before
dirtying DICE. This makes dynamic_output redirects invalidate the
producer action that can recreate the missing blob.

When a rewound action is replayed, bypass both Buck action-cache lookups
and the remote executor cache lookup. Remote execution can otherwise
return the same cached ActionResult, leaving the missing CAS blob absent
and causing the consumer to hit the rewind cap.

When local materialization discovers an expired CAS entry, convert the
materializer not-found error into the same typed context. Also treat
default final materialization not-found errors as rewindable, since
materializations = deferred still materializes requested outputs unless
the stricter skip-final mode is selected.

When final materialization and final upload run together, upload can
report a missing CAS blob before materialization cleans the shared
queue. Because those branches run under try_compute2, the upload error
can drop the materialization side before it removes queue_tracker
entries. Clear those entries on the upload-side rewind path as well.
Also clear the per-transaction materialization queue after committing a
rewind, so the retry does not skip outputs that were queued before the
DICE transaction was invalidated.

The first version of this fix was developed and exercised against
BuildBuddy remote execution. For upstream, the tests were rewritten to
use Meta's remote-execution test hooks and a hybrid execution platform
instead of depending on external RE configuration. They cover remote
generated inputs, directory leaves, worker-side missing input reports,
local-only consumers, default final materialization, final upload with
materialization, and a missing-input count above the repeated-rewind
cap.

References:

Remote CAS loss could leave deferred builds unable to recover generated inputs or final outputs without a daemon restart. This matters because the build graph still contains producer actions that can recreate those blobs, but Buck was not always turning missing-CAS errors into graph rewinds. The existing retry covered some upload and forced materialization failures, but missed default final materialization, local executor input materialization, and upload passes with several generated inputs missing at once. Track lost inputs as a typed batch so the build command can dirty all producer BuildKey nodes and the consumer in one rewind. Canonicalize rewind keys through the registered action lookup before dirtying DICE. This makes dynamic_output redirects invalidate the producer action that can recreate the missing blob. When a rewound action is replayed, bypass both Buck action-cache lookups and the remote executor cache lookup. Remote execution can otherwise return the same cached ActionResult, leaving the missing CAS blob absent and causing the consumer to hit the rewind cap. When local materialization discovers an expired CAS entry, convert the materializer not-found error into the same typed context. Also treat default final materialization not-found errors as rewindable, since materializations = deferred still materializes requested outputs unless the stricter skip-final mode is selected. When final materialization and final upload run together, upload can report a missing CAS blob before materialization cleans the shared queue. Because those branches run under try_compute2, the upload error can drop the materialization side before it removes queue_tracker entries. Clear those entries on the upload-side rewind path as well. Also clear the per-transaction materialization queue after committing a rewind, so the retry does not skip outputs that were queued before the DICE transaction was invalidated. The tests use Buck remote-execution test hooks and a hybrid execution platform instead of external RE configuration. They cover remote generated inputs, directory leaves, worker-side missing input reports, local-only consumers, default final materialization, final upload with materialization, and a missing-input count above the repeated-rewind cap.

meta-codesync · 2026-06-28T15:59:33Z

This pull request has been imported. If you are a Meta employee, you can view this in D109988939. (Because this pull request was imported automatically, there will not be any future comments.)

JakobDegen · 2026-06-30T01:02:22Z

This is a really interesting PR. Despite me leaving what may be interpreted as negatively disposed comments below, thank you for making it, regardless of what happens it's a valuable thing for us to consider.

So, I'm going to be fully transparent: In my eyes, anything of this form is more or less cope for a missing feature in the RE API. We have not built something like this internally at Meta for the basic reason that our current approach to handling these errors is good enough for the very rare case they happen. The reason that they are so much more rare for us than they will be for you is because our RE API has explicit TTL support.

Before anything else, I want to confirm: The automatic restart on TTL expiration works in OSS, yes? This PR is motivated by a desire to improve the experience around that?

I think there are a variety of other potential solutions to the OSS versions of this problem that differ significantly in the extent to which they solve the problem, exactly what benefits they bring, what costs they have, and where those costs are localized. I'd like a change like this to come with a more explicit discussion of that solution space and justification for why we're picking the point in it that we are.

I think the first question to ask is this: Why can't OSS RE be amended to support explicit TTLs? Is there something that I am missing that makes that not clearly the best solution? (Note that I write this without having clicked on your links which very well may answer that question).

My second question is basically: What, in practice, is the TTL policy that various RE implementations actually use? Could that TTL policy be adjusted to make this less of a problem? In particular, it looks to me like there are a lot of knobs that could be tweaked without API changes to reduce the impact of this issue without requiring API changes.

When a rewound action is replayed, bypass [...] remote executor cache lookup

This in particular looks like a bit of a red flag to me - even API and knobs aside, it should be relatively easy for an RE implementation to at least keep its TTLs internally consistent; why isn't that happening?

Together, in terms of solution space I think that implies a bunch of alternatives:

This
This but without the cache bypassing on the retry
Fix RE implementations to not do blatantly wrong things like have AC entries pointing to non-existent CAS entries
Have RE implementations internally refresh their TTLs on - say - AC access
Have buck validate the presence of action outputs in CAS on AC hits (potentially in addition to the next item)
Make buck2 download things eagerly but in the background behind a config that can be set for RE providers that are untrusted in terms of TTL handling.
Add TTLs to the RE API
Probably more

scottcao

Review automatically exported from Phabricator review in Meta.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

sluongng marked this pull request as ready for review June 28, 2026 15:57

scottcao requested changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

action rewinding: recover lost CAS blobs#1349

action rewinding: recover lost CAS blobs#1349
sluongng wants to merge 1 commit into
facebook:mainfrom
sluongng:sluongng/action-rewinding-upstream-check

sluongng commented Jun 28, 2026

Uh oh!

meta-codesync Bot commented Jun 28, 2026

Uh oh!

JakobDegen commented Jun 30, 2026

Uh oh!

scottcao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sluongng commented Jun 28, 2026

Uh oh!

meta-codesync Bot commented Jun 28, 2026

Uh oh!

JakobDegen commented Jun 30, 2026

Uh oh!

scottcao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants