Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 40 additions & 16 deletions build/bazel/remote/execution/v2/remote_execution.proto
Original file line number Diff line number Diff line change
Expand Up @@ -440,13 +440,21 @@ service ContentAddressableStorage {
option (google.api.http) = { get: "/v2/{instance_name=**}/blobs/{root_digest.hash}/{root_digest.size_bytes}:getTree" };
}

// Split a blob into chunks.
//
// This call splits a blob into chunks, stores the chunks in the CAS, and
// returns a list of the chunk digests. Using this list, a client can check
// which chunks are locally available and just fetch the missing ones. The
// desired blob can be assembled by concatenating the fetched chunks in the
// order of the digests in the list.
// Split retrieves information about how a blob is split into chunks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the original doc, this new version missed the part where the chunks MUST/SHOULD be stored in CAS. It's something obvious to an experienced maintainer, but should be documented to help newer implementations onboard the spec.

//
// This call returns information about how a blob is split into chunks, and
// returns a list of the chunk digests. The server returns a known manifest
// if one exists for the requested blob digest. If no manifest exists, the
// server MAY compute a new one by performing chunking on-demand if the blob
// exists in the CAS, though this should be rare as manifests are typically
// created when blobs are written. The original blob digest does not need to
// be present in the CAS for this call to succeed if a manifest already exists.
Comment on lines +449 to +451
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// exists in the CAS, though this should be rare as manifests are typically
// created when blobs are written. The original blob digest does not need to
// be present in the CAS for this call to succeed if a manifest already exists.
// exists in the CAS.
Suggested change
// exists in the CAS, though this should be rare as manifests are typically
// created when blobs are written. The original blob digest does not need to
// be present in the CAS for this call to succeed if a manifest already exists.
// exists in the CAS. The original blob digest does not need to

Leave the manifest creation as an implementation detail and avoid discussing it in the spec. That gives the server more freedom to pick the right tradeoffs (i.e. chunking eagerly vs lazily).

The server can keep both the chunked version and the non-chunked version of the same blob to serve both kinds of requests.

//
// Using the returned list of chunk digests, a client can check which chunks
// are locally available and just fetch the missing ones. The desired blob can
// be assembled by concatenating the fetched chunks in the order of the
// digests in the list. ZSTD-compressed chunks can be concatenated without
// decompression.
Comment on lines +456 to +457
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTD-compressed chunks can be concatenated without decompression.

I don't think this is correct.

There are a few ways compression can be applied here:

a. Compress the original large blob before chunking do not compress chunks
b. Compress the original large blob, chunk, then compress the chunks additionally
c. Chunk the uncompressed large blob, then compress the chunks individually

From the previous discussions, I think we do not care about (a) and (b) and will mostly be doing (c). Theoretically, (c) should give us a better chunk "hit rate" when using a consistent chunking algorithm (i.e. FastCDC) across different versions of the same large blob, as the compression algorithm can shuffle contents around.

Under (c), each chunk will be compressed individually. To get back the original large blob, one would need to decompress each chunk before concatenating them based on the order in the response.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not care about (a) and (b) and will mostly be doing (c).

Yes, agree

To get back the original large blob, one would need to decompress each chunk before concatenating them based on the order in the response.

One nice property of ZSTD is you can concatenate first and then decompress. This is because with ZSTD: Decompress(Compress(A1) + Compress(A2)) == A1 + A2

So as long as we do (c) for creation of the chunks, the user can get the full file compressed by concatenating the compressed chunks.

I think this is probably also not important to the spec, and I should remove it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec has DEFLATE and BROTLI as well, which may not share that property.

//
// This rpc can be used to reduce the required data to download a large blob
// from CAS if chunks from earlier downloads of a different version of this
Expand All @@ -459,6 +467,8 @@ service ContentAddressableStorage {
// 1. The blob chunks are stored in CAS.
// 2. Concatenating the blob chunks in the order of the digest list returned
// by the server results in the original blob.
//
// A client should NOT expect that the original blob is available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A client should NOT expect that the original blob is available.
// A client SHOULD NOT expect that the original blob is available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think this is true.

It's up to the implementation to decide whether to keep or discard the original blob after splitting. From the client perspective, nothing should change: calling ByteStream.Read or FindMissing over the original blob after splitting SHOULD still yield the original blob.

Keep in mind that we are assuming different types of clients are accessing the same remote cache. Some clients may support Split/Splice APIs, some may NOT. And it's important to remind the server implementations that both types MAY exist at the same time.

//
// Servers which implement this functionality MUST declare that they support
// it by setting the
Expand Down Expand Up @@ -491,26 +501,35 @@ service ContentAddressableStorage {
//
// Errors:
//
// * `NOT_FOUND`: The requested blob is not present in the CAS.
// * `NOT_FOUND`: No manifest exists for the blob and the blob is not present
// in the CAS.
// * `RESOURCE_EXHAUSTED`: There is insufficient disk quota to store the blob
// chunks.
rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {
option (google.api.http) = { get: "/v2/{instance_name=**}/blobs/{blob_digest.hash}/{blob_digest.size_bytes}:splitBlob" };
}

// Splice a blob from chunks.
// Splice tells the CAS how chunks compose a blob.
//
// This is the complementary operation to the
// [ContentAddressableStorage.SplitBlob][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlob]
// function to handle the chunked upload of large blobs to save upload
// traffic.
//
// When uploading a large blob using chunked upload, clients MUST first upload
// all chunks to the CAS, then call this RPC to store a manifest that describes
// how those chunks compose the original blob. The chunks referenced in the
// manifest must be available in the CAS before calling this RPC. The original
// blob does not need to be available: the correctness of the manifest can be
// validated from manifest correctness and by verifying that the chunks match
// their specified digests.
Comment on lines +519 to +525
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// When uploading a large blob using chunked upload, clients MUST first upload
// all chunks to the CAS, then call this RPC to store a manifest that describes
// how those chunks compose the original blob. The chunks referenced in the
// manifest must be available in the CAS before calling this RPC. The original
// blob does not need to be available: the correctness of the manifest can be
// validated from manifest correctness and by verifying that the chunks match
// their specified digests.
// When uploading a large blob using chunked upload, clients MUST first upload
// all chunks to the CAS, then call this RPC to store a manifest that describes
// how those chunks compose the original blob. The chunks referenced in the
// manifest SHOULD be available in the CAS before calling this RPC. The original
// blob does not need to be available: the correctness of the manifest can be
// validated from manifest correctness and by verifying that the chunks match
// their specified digests.

s/must/SHOULD/. We don't have a transactional guarantee between uploading chunks and calling Splice API. So a remote cache with a very short TTL can evict some of the chunks by the time the client finishes uploading and calls SpliceBlob. In those cases, the client should expect an error response from the server.

The original blob does not need to be available:
the correctness of the manifest can be validated from manifest correctness

Huh? Might need some rewording here

and by verifying that the chunks match their specified digests.

Keep in mind that SpliceBlobRequest.blob_digest is an optional field.
The client can upload a bunch of chunks and Splice, then use SpliceBlobResponse.blob_digest to continue the build. That does imply that the client "trusts" the server's splicing result, but it will save the client from having to hash the large blob itself.

//
// If a client needs to upload a large blob and is able to split a blob into
// chunks in such a way that reusable chunks are obtained, e.g., by means of
// content-defined chunking, it can first determine which parts of the blob
// are already available in the remote CAS and upload the missing chunks, and
// then use this API to instruct the server to splice the original blob from
// the remotely available blob chunks.
// then use this API to store the manifest describing how the chunks compose
// the original blob.
//
// Servers which implement this functionality MUST declare that they support
// it by setting the
Expand All @@ -522,17 +541,22 @@ service ContentAddressableStorage {
//
// In order to ensure data consistency of the CAS, the server MUST only add
// blobs to the CAS after verifying their digests. In particular, servers MUST NOT
// trust digests provided by the client. The server MAY accept a request as no-op
// if the client-specified blob is already in CAS; the lifetime of that blob SHOULD
// be extended as usual. If the client-specified blob is not already in the CAS,
// the server SHOULD verify that the digest of the newly created blob matches the
// digest specified by the client, and reject the request if they differ.
// trust digests provided by the client. The server MUST validate the manifest
// and verify that all referenced chunks exist in the CAS and match their
// specified digests. The server MAY optionally verify that concatenating the
// chunks results in a blob matching the original blob digest, particularly if
// the client is not trusted. The server MAY accept a request as no-op if a
// manifest for the client-specified blob already exists; the lifetime of that
// manifest and its chunks SHOULD be extended as usual.
Comment on lines +544 to +550
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server MUST validate the manifest
and verify that all referenced chunks exist in the CAS and match their
specified digests.

This seems to assume that the large blob exists only in the manifest + chunk form, not in its original form. In reality, servers may choose to store both. So this is an implementation detail our spec should avoid taking an opinion on.

The server MAY optionally verify that concatenating the
chunks results in a blob matching the original blob digest, particularly if
the client is not trusted.

We already established a few sentences earlier that "the server MUST NOT trust ... the client", so the "if the client is not trusted" part should be implied? This sentence feels unnecessary to me.

The server MAY accept a request as no-op if a
manifest for the client-specified blob already exists; the lifetime of that
manifest and its chunks SHOULD be extended as usual.

I know this is from the original text, but I wonder if this should be updated.

In particular, let's say I were to send this splice request to the server.

{
  "blob_digest": "<big-blob-does-exist-in-CAS>",
  "chunk_digests": [] // empty - an invalid list of chunks
}

What the current text is advising: if blob_digest already exists, then the server MAY accept this result. It also implies that if the large blob exists in the manifest + chunks form, the manifest could be updated to the latest list of chunks. This seems wrong.

On one hand, I do think it's nice to have a short circuit way to skip having to verify the concatenated result. But it might be worth reminding the server implementations that the server should never update the underlying manifest without verifying the concatenated result. This would help us disambiguate cases where the clients want to update the chunking result to an improved chunking algorithm / configuration versus cases where malicious actors tried to replace an existing chunk manifest with a malicious one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about this a little bit more: I think that the server should never let SpliceBlob update an existing large blob manifest.

It's ok to create a new one. In that case, the server MUST verify the concat result.
But never update existing manifests (bumping TTL should be ok).

//
// When blob splitting and splicing is used at the same time, the clients and
// the server SHOULD agree out-of-band upon a chunking algorithm used by both
// parties to benefit from each others chunk data and avoid unnecessary data
// duplication.
//
// After a successful Splice call, a subsequent Split call for the same blob digest
// SHOULD retrieve the manifest information that was stored.
//
// Errors:
//
// * `NOT_FOUND`: At least one of the blob chunks is not present in the CAS.
Expand Down