Skip to content

Conversation

@thoughtpolice
Copy link
Contributor

Currently, upload requests are handled in parallel without knowledge of other ongoing requests. If multiple actions depend on the same set of large locally available artifacts, then they will all be uploaded at the same time. This is particularly poor behavior for large files.

Just store in-flight requests in a dashmap, and wait if an upload is already extant.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2025
@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. (Because this pull request was imported automatically, there will not be any future comments.)

@thoughtpolice
Copy link
Contributor Author

This is an up-to-date refile of #750

));
// Mark artifact as uploaded and notify other potentially waiting tasks.
if upload_ret.is_ok() {
prev_uploads.alter(&digest, |_, _| OngoingUploadStatus::Done);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does prev_uploads grown unbounded? was expecting to see a removal here to keep it sized to the inflight digests

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern was with the following race:

  1. Task 1: started an upload of digest ABC
  2. Task 2: finds ABC is missing as the upload hasn't succeeded yet
  3. Task 1: finishes upload, notifies other ongoing uploads of success, removes entry from prev_uploads
  4. Task 2: starts upload, causing it to attempt to reupload

Thinking about this more the issue isn't that bad, as RE should immediately respond to the upload request that it is fully committed. To remove that final upload we could also directly populate REClient::find_missing_cache once the upload is finished and checking REClient::find_missing_cache before starting an upload.

@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 21b8a52 to 7165604 Compare May 7, 2025 17:26
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch 3 times, most recently from 5b93158 to 2b946ad Compare May 9, 2025 14:25
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch 15 times, most recently from 0eb7b2b to b93b023 Compare May 31, 2025 00:37
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from b93b023 to 71c5b28 Compare June 3, 2025 20:19
enum OngoingUploadStatus {
Active(tokio::sync::watch::Receiver<Result<(), ()>>),
Done,
Error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to store actual error here?

);

enum UploadStatus {
New(tokio::sync::watch::Sender<Result<(), ()>>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use oneshot channel? My understanding is we use it only for synchronization

let _ = tx.send(upload_ret.as_ref().map_err(|_| ()).cloned());
} else {
prev_uploads.alter(&digest, |_, _| OngoingUploadStatus::Error);
let _ = tx.send(Err(()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can send a real error here, otherwise I don't see a point of this error.

@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch 4 times, most recently from 21367ba to ae010d1 Compare June 8, 2025 04:41
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch 5 times, most recently from f9a76a2 to 292b461 Compare June 17, 2025 19:59
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 292b461 to 5001c5d Compare June 24, 2025 22:28
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 5001c5d to 573850b Compare July 7, 2025 04:12
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 573850b to 4b88a7c Compare July 23, 2025 00:27
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch 2 times, most recently from 0dad29b to 8488f36 Compare August 4, 2025 15:52
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 8488f36 to 1e5f3db Compare August 7, 2025 21:29
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 1e5f3db to a984853 Compare September 5, 2025 06:07
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from a984853 to 189e8bc Compare September 25, 2025 07:39
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from 189e8bc to c69c68c Compare October 11, 2025 19:16
@bvinc
Copy link
Contributor

bvinc commented Oct 15, 2025

There is a subtle bug in this PR. All of these remote execution functions are being executed from tasks that can be canceled. When a second task gets a tokio::sync::watch::channel Receiver to wait on the outcome of the first task, you've got a problem: The first task may end up being canceled. I'm guessing that will cause the Sender side to drop and cause the Reciever to get a "channel closed" error. I think the best fix for this would probably be to used shared futures. https://docs.rs/futures/latest/futures/future/trait.FutureExt.html#method.shared

Another interesting observation here is that this PR only deduplicates larger blobs. Smaller blobs will still be uploaded multiple times from multiple ongoing tasks.

Currently, upload requests are handled in parallel without knowledge of
other ongoing requests. If multiple actions depend on the same set of
large locally available artifacts, then they will all be uploaded at
the same time. This is particularly poor behavior for large files.

Just store in-flight requests in a dashmap, and wait if an upload is
already extant.

Authored-by: Hugo van der Wijst <[email protected]>
Signed-off-by: Hugo van der Wijst <[email protected]>
Signed-off-by: Austin Seipp <[email protected]>
@thoughtpolice thoughtpolice force-pushed the aseipp/push-unoqptxykrwl branch from c69c68c to 519a2bd Compare October 25, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants