directory download and uploads are slow




# Bug Report

* This affects multiple commands, `get`/`fetch`/`pull`/`push`... so I didn't put a tag. 


## Description


We have added a directory containing 70,000 small images to the Dataset Registry. There is also a `tar.gz` version of the dataset which is downloaded quickly: 

```bash
time dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz
dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz  3.41s user 1.36s system 45% cpu 10.411 total
```

When I issue:

```bash
dvc get https://github.com/iterative/dataset-registry mnist/images
```
![Screen Shot 2021-06-24 at 14 04 50](https://user-images.githubusercontent.com/476310/123254915-e8a3ed80-d4f7-11eb-98fe-37b21c9078ee.png)

I get ~16 hours ETA for 70.000 downloads in my VPS. 

This is reduced to ~3 hours on my faster local machine. 

![Screen Shot 2021-06-24 at 14 27 34](https://user-images.githubusercontent.com/476310/123255318-5f40eb00-d4f8-11eb-8e84-7febe2141624.png)

I didn't wait to finish these, so the real times may be different but you get the idea. 

For `-j 10` it doesn't differ much: 

![Screen Shot 2021-06-24 at 14 31 27](https://user-images.githubusercontent.com/476310/123255797-e9894f00-d4f8-11eb-968d-2bfcd6961928.png)

`dvc pull` is better, it's takes about 20-25 minutes.

![Screen Shot 2021-06-24 at 14 34 54](https://user-images.githubusercontent.com/476310/123256355-8cda6400-d4f9-11eb-8041-207cf7d2680a.png)

(At this point, while writing a new version released and the rest of the report is in `2.4.1` 😄 )

`dvc pull -j 100` seems to reduce the ETA to 10 minutes. 

![Screen Shot 2021-06-24 at 14 41 47](https://user-images.githubusercontent.com/476310/123257043-53eebf00-d4fa-11eb-921c-2957388adbf0.png)

(I waited for `dvc pull -j 100` to finish and it took ~15 minutes.)

I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there. 

### Reproduce

```
git clone https://github.com/iterative/dataset-registry
cd dataset-registry
dvc pull mnist/images.dvc
```

or 

```bash
dvc get https://github.com/iterative/dataset-registry mnist/images
```

### Expected

We will use this dataset (and `fashion-mnist` similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download. 

### Environment information

**Output of `dvc doctor`:**

Some of this report is with `2.3.0` but currently:

```console
$ dvc doctor
DVC version: 2.4.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-74-generic-x86_64-with-glibc2.29
Supports: azure, gdrive, gs, hdfs, webhdfs, http, https, s3, ssh, oss
```

# Discussion

DVC uses [new `requests.Session` objects in connection](https://github.com/iterative/dvc/blob/master/dvc/fs/http.py#L77) and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time. 

[There is a mechanism in HTTP/1.1 to use the same connection.](https://en.wikipedia.org/wiki/HTTP_pipelining) but [`requests` doesn't support it.](https://github.com/urllib3/urllib3/pull/1074). 

Note that increasing the number of jobs doesn't make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it's probably a small number (~4-8) of these can be connected at a time. (I'm banned from AWS once while testing the commands with large `-j`.)

There may be 2 solutions for this: 

* DVC can consider directories as implicit `tar` archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them in `checkout`. [`tar`](https://docs.python.org/3/library/tarfile.html) and [`gzip`](https://docs.python.org/3/library/gzip.html) are supported in Python standard library. This probably requires all `Repo` class to be updated though. 

* Instead of `requests`, DVC can use a custom solution or another library like [`dugong`](https://pypi.org/project/dugong/) that supports HTTP pipelining. I didn't test any HTTP pipelining solution in Python, so I can't vouch for any of them but this may be better for all asynchronous operations using HTTP(S). 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

directory download and uploads are slow #6222

Bug Report

Description

Reproduce

Expected

Environment information

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

directory download and uploads are slow #6222

Description

Bug Report

Description

Reproduce

Expected

Environment information

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions