Open
Description
Bug Report
Description
I have the following DVC structure (output of dvc dag
)
+------------------------------+ +---------------------+ +--------------------------+
| data/raw/image_infos.csv.dvc | | data/raw/images.dvc | | data/raw/annotations.dvc |
+------------------------------+ +---------------------+ ********+--------------------------+
*** *** *********
*** *** *********
** ** *****
+---------------+ +------------------------------------------------------+
| generate_data | | data/models/pretrained_model_template.state_dict.dvc |
+---------------+* +------------------------------------------------------+
**** ****
**** ****
** **
+-------+
| train |
+-------+
and my dvc.yaml
(simplified version) looks like this:
vars:
- dvc_model:
top_folder_dir: data/models/dvc_model
stages:
generate_data:
desc: 'Data generation'
cmd: python3 code/data_processing/generate_data.py
deps:
- ${data_processing.paths.image_infos_csv}
- ${data_processing.paths.raw_data_dir}/images
- ${data_processing.paths.raw_data_dir}/annotations
- code/scenario/classes.py
- ../base/data_processing/generate_data.py
- ../base/data_processing/annotations.py
- ../base/data_processing/dataset.py
- ../base/scenario/classes.py
params:
- data_processing.patch_extraction
- data_processing.calculate_stats_for_helper_classes
outs:
- ${data_processing.paths.generated_data_dir}/patches/train:
cache: true
push: false
- ${data_processing.paths.generated_data_dir}/annotations:
cache: true
push: false
- ${data_processing.paths.generated_data_dir}/project_descriptor.json:
cache: true
push: false
train:
desc: 'Trains a model for k epochs'
cmd: python3 code/training/run_training.py
train.pl_environment.paths.dir_dvc_model_out=${dvc_model.top_folder_dir}
deps:
- ${data_processing.paths.generated_data_dir}/patches/train
- ${data_processing.paths.generated_data_dir}/project_descriptor.json
- ${data_processing.paths.generated_data_dir}/patches/train/stats_rgb.csv
- ${train.pl_environment.paths.pretrained_model}
- ../base/data_processing/dataset.py
- ../base/training/run_training.py
- code/training/run_training.py
- code/scenario/classes.py
params:
- train.pl_hparams
- train.pl_environment.best_model_scraper
- train.pl_environment.seed
- train.pl_environment.num_workers
outs:
- ${dvc_model.top_folder_dir}/model.pt:
cache: true
The stage generate_data
has a heavy ouptut ${data_processing.paths.generated_data_dir}/patches/train
. It is pretty heavy as it contains a big number of patches (577.331
files). So to initiate an isolated experiment (by either queued experiments or temp experiments):
- the operation
Collecting files and computing hashes in data/generated_datasets/default/patches/train
takes a lot of time. (sometimes even 12 minutes) 🐢 - the operation
Collecting files and computing hashes ...
is being executed multiple times, which I don't understand why. 🤔
More precisely by running in verbose mode (-v
), I get the following waiting times:
- It checkouts the generated data to the temp (from the local cache). This takes some time but it is understandable.
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 1st time: (takes ~1 min with14kfile/s
)
.
.
2023-08-09 10:11:24,121 DEBUG: Assuming 'train' to be a stage inside 'dvc.yaml'
2023-08-09 10:11:24,225 DEBUG: Computed stage: 'data/raw/image_infos.csv.dvc' md5: 'None'
'data/raw/image_infos.csv.dvc' didn't change, skipping
2023-08-09 10:11:24,229 DEBUG: Computed stage: 'data/raw/images.dvc' md5: 'None'
2023-08-09 10:11:24,268 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'
'data/raw/images.dvc' didn't change, skipping
2023-08-09 10:11:24,275 DEBUG: Computed stage: 'data/raw/annotations.dvc' md5: 'None'
2023-08-09 10:11:24,644 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'
'data/raw/annotations.dvc' didn't change, skipping
2023-08-09 10:11:24,680 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'
2023-08-09 10:11:24,917 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 2nd time: (takes 12 mins with700file/s
)
.
.
Stage 'generate_data' didn't change, skipping
2023-08-09 10:12:31,275 DEBUG: Computed stage: 'data/models/pretrained_model_template.state_dict.dvc' md5: 'None'
'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 3rd time: (takes 1 min with14kfile/s
)
.
.
'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping
2023-08-09 10:24:58,062 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:24:58,170 DEBUG: Dependency 'data/generated_datasets/default/patches/train' of stage: 'train' changed because it is 'modified'.
2023-08-09 10:24:58,172 DEBUG: stage: 'train' changed.
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.ckpt' of stage: 'train'.
2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.ckpt'
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.state_dict' of stage: 'train'.
2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.state_dict'
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.pt' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.pt'
2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/stats_rgb.csv' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/stats_rgb.csv'
2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/hparams.yaml' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/hparams.yaml'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 4th time: (takes ~1 min with14kfile/s
)
.
.
2023-08-09 10:25:43,247 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:28:54,516 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 5th time: (takes ~1 min with14kfile/s
)
.
.
2023-08-09 10:29:38,244 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:29:38,341 DEBUG: {'data/generated_datasets/default/patches/train': 'modified'}
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 6th time: (takes ~1 min with14kfile/s
)
.
.
Collecting files and computing hashes in data/generated_datasets/default/patches/train |301k [00:21, 15.5kfile/s]
2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- 7th time: (takes ~1 min with14kfile/s
)
2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
2023-08-09 10:31:14,382 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:31:23,380 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- Finally, the
train
stage is invoked ✔️
2023-08-09 10:32:16,223 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
Running stage 'train':
> python3 code/training/run_training.py train.pl_environment.paths.dir_dvc_model_out=data/models/dvc_model `
- After
train
is done, once againCollecting files and computing hashes in data/generated_datasets/default/patches/train
(takes ~1 min with14kfile/s
)
[2023-08-09 10:39:15,126][segmenttools.data.copy][WARNING] - Folder data/models/dvc_model already exists. This folder will be replaced!
[2023-08-09 10:39:15,451][segmenttools.inference_pipeline.gpu_availability][INFO] - Released lock for GPU: 1 by process with id: 1916752
[2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- And again
Collecting files and computing hashes in data/generated_datasets/default/patches/train
(takes ~1 min with14kfile/s
)
2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
2023-08-09 10:40:00,411 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'
2023-08-09 10:40:09,511 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
2023-08-09 10:40:10,658 DEBUG: Computed stage: 'train' md5: 'e4a099e883959330ae7111a6383d6006'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
- And again
Collecting files and computing hashes in data/generated_datasets/default/patches/train
(takes ~1 min with14kfile/s
)
2023-08-09 10:41:03,037 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
Expected
I would expect for the operation Collecting files and computing hashes in data/generated_datasets/default/patches/train
not to be invoked so many times!
Environment information
Output of dvc doctor
:
$ dvc doctor
-------------------------
Platform: Python 3.8.13 on Linux-5.4.0-153-generic-x86_64-with-glibc2.17
Subprojects:
dvc_data = 2.11.0
dvc_objects = 0.24.1
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.0.4
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
ssh (sshfs = 2023.4.1)
Config:
Global: /home/deep/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Caches: local
Remotes: ssh, s3
Workspace directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/840dbd56066b2803ce384b8868549044