Skipping intermediate steps for highly cached builds

Consider a simple C++ binary. It generates two actions:
Compile (`main.cc` -> `main.o`)
Link (`main.o` -> `main`)

When you attempt to build main with remote execution enabled, it needs to calculate the action digest of the link action. However, in order to determine the action digest of the link action, you need the digest of all input files (main.o). To retrieve this, you need to request the action digest of the compile action. So, your critical path for a fully remotely cached build looks like:

1. Calculate the digest of main.cc locally
2. Calculate the action digest of the compile action
3. Send `ActionCache.GetActionResult(compile_action_digest)` and wait for the response
4. Read the response to get the digest of `main.o`, and use that to calculate the digest of the link action
5. Send `ActionCache.GetActionResult(link_action_digest)` and wait for the response
6. Read the digest of the `main` binary from the result, then read it from the CAS

In general, a fully cached action needs to execute `O(#targets)` `GetActionResult` requests, and the depth of the action graph will be the maximum number of requests that must be serialized.

However, we can make a simple observation. It is a sufficient (but not necessary) condition that for two actions to generate the same output, they have the same set of *transitive* inputs. More specifically, the same set of transitive *source* inputs (where a source file is not a generated file).

We can now define two digest functions for an action:
* `action_digest(action) = hash((action.cmd, [input.digest for input in action.inputs]))` (this is the existing one)
* `transitive_source_action_digest(action) = hash((action.cmd, [transitive_source_action_digest(input.generating_action) if input.is_generated else input.digest for input in action.inputs]))`

The action digest is extremely accurate, but is expensive to calculate, as it requires hitting the remote cache many times. On the other hand, the transitive source action digest isn't particularly accurate, but is cheap to calculate as it can be done fully locally with no network calls.

We can combine the best of both worlds, however, by simply using both. My proposal is:
* We add a field `transitive_source_action_digest` to `UpdateActionResultRequest`
* We either map `transitive_source_action_digest(action)` to `action_digest(action)`, or attempt to inline it and map it directly to `ActionResult`.
* We add a field `transitive_source_action_digest` to `GetActionResultRequest`
  * Could use existing field `action_digest` instead
* Allow `GetActionResultRequest` to query for either the `transitive_source_action_digest` field or the `action_digest` field

With this mechanism, suppose you were building a large binary containing many targets. You would now instead, in parallel (you could do it sequentially, but it'd be slower probably):
* Bottom-up, calculate the transitive source action digest. Once completed, start querying `GetActionResultRequest` on it top-down
* Bottom-up, calculate the action digest to eventually calculate the action digest

In the case that it was already built at that commit by another user, this would ensure that you can build chrome in O(1) time, rather than the current, which is somewhere between O(depth) and O(n), depending on a variety of factors such as network performance and network parallelism.

This optimization could be implemented without changes to RBE by simply executing two `UpdateActionResultRequest`s each time you execute an action, but doing it in RBE would improve performance and halve the storage space required. I also mention it here for visibility so that various build systems will hopefully see the idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skipping intermediate steps for highly cached builds #321

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Skipping intermediate steps for highly cached builds #321

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions