Description
@shcheklein in that case our concurrency level will be jobs * jobs
, which is generally going to be way too high in the default case. I also considered splitting jobs
between the two (so batch_size=(min(1, jobs // 2))
and the same for max_concurrency
) but that will make us perform a lot worse in the cases where you are pushing a large # of files that are smaller than the chunk size
I think it will be worth revisiting this to properly determine what level of concurrency we should be using at both the file and chunk level, but that is dependent on the number of files being transferred, the size of all of those files, and the chunk size for the given cloud. This is all work that we can do at some point, but in the short term I prioritized getting a fix for the worst case scenario for azure (pushing a single large file).
Also, any work that we do on this right now would only work for Azure, since right now adlfs is the only underlying fsspec implementation that actually does concurrent chunked/multipart upload/downloads . It would be better for us to contribute upstream to make the s3/gcs/etc implementations support the chunk/multipart concurrency first, before we get into trying to make DVC optimize balancing file and chunk level concurrency
Originally posted by @pmrowla in iterative/dvc-objects#218 (comment)