Skip to content

cloud/fs concurrency for large files #9893

Open
@dberenbaum

Description

@dberenbaum

@shcheklein in that case our concurrency level will be jobs * jobs, which is generally going to be way too high in the default case. I also considered splitting jobs between the two (so batch_size=(min(1, jobs // 2)) and the same for max_concurrency) but that will make us perform a lot worse in the cases where you are pushing a large # of files that are smaller than the chunk size

I think it will be worth revisiting this to properly determine what level of concurrency we should be using at both the file and chunk level, but that is dependent on the number of files being transferred, the size of all of those files, and the chunk size for the given cloud. This is all work that we can do at some point, but in the short term I prioritized getting a fix for the worst case scenario for azure (pushing a single large file).

Also, any work that we do on this right now would only work for Azure, since right now adlfs is the only underlying fsspec implementation that actually does concurrent chunked/multipart upload/downloads . It would be better for us to contribute upstream to make the s3/gcs/etc implementations support the chunk/multipart concurrency first, before we get into trying to make DVC optimize balancing file and chunk level concurrency

Originally posted by @pmrowla in iterative/dvc-objects#218 (comment)

Tasks

Tasks

Preview Give feedback

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-syncRelated to dvc get/fetch/import/pull/pushfs: gsRelated to the Google Cloud Storage filesystemp3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions