Skip to content

dvc exp run --temp: Collecting files and computing hashes takes a lot of time #9823

Open
@lefos99

Description

@lefos99

Bug Report

Description

I have the following DVC structure (output of dvc dag)

+------------------------------+         +---------------------+         +--------------------------+         
| data/raw/image_infos.csv.dvc |         | data/raw/images.dvc |         | data/raw/annotations.dvc |         
+------------------------------+         +---------------------+ ********+--------------------------+         
                        ***               ***           *********                                             
                           ***         ***     *********                                                      
                              **     **   *****                                                               
                          +---------------+         +------------------------------------------------------+  
                          | generate_data |         | data/models/pretrained_model_template.state_dict.dvc |  
                          +---------------+*        +------------------------------------------------------+  
                                            ****                   ****                                       
                                                ****           ****                                           
                                                    **       **                                               
                                                    +-------+                                                 
                                                    | train |                                                 
                                                    +-------+   

and my dvc.yaml (simplified version) looks like this:

vars:
  - dvc_model:
      top_folder_dir: data/models/dvc_model

stages:
  generate_data:
    desc: 'Data generation'
    cmd: python3 code/data_processing/generate_data.py
    deps:
      - ${data_processing.paths.image_infos_csv}
      - ${data_processing.paths.raw_data_dir}/images
      - ${data_processing.paths.raw_data_dir}/annotations
      - code/scenario/classes.py
      - ../base/data_processing/generate_data.py
      - ../base/data_processing/annotations.py
      - ../base/data_processing/dataset.py
      - ../base/scenario/classes.py
    params:
      - data_processing.patch_extraction
      - data_processing.calculate_stats_for_helper_classes
    outs:
      - ${data_processing.paths.generated_data_dir}/patches/train:
          cache: true
          push: false
      - ${data_processing.paths.generated_data_dir}/annotations:
          cache: true
          push: false
      - ${data_processing.paths.generated_data_dir}/project_descriptor.json:
          cache: true
          push: false
  train:
    desc: 'Trains a model for k epochs'
    cmd: python3 code/training/run_training.py 
      train.pl_environment.paths.dir_dvc_model_out=${dvc_model.top_folder_dir}
    deps:
      - ${data_processing.paths.generated_data_dir}/patches/train
      - ${data_processing.paths.generated_data_dir}/project_descriptor.json
      - ${data_processing.paths.generated_data_dir}/patches/train/stats_rgb.csv
      - ${train.pl_environment.paths.pretrained_model}
      - ../base/data_processing/dataset.py
      - ../base/training/run_training.py
      - code/training/run_training.py
      - code/scenario/classes.py
    params:
      - train.pl_hparams
      - train.pl_environment.best_model_scraper
      - train.pl_environment.seed
      - train.pl_environment.num_workers
    outs:
      - ${dvc_model.top_folder_dir}/model.pt:
          cache: true

The stage generate_data has a heavy ouptut ${data_processing.paths.generated_data_dir}/patches/train. It is pretty heavy as it contains a big number of patches (577.331 files). So to initiate an isolated experiment (by either queued experiments or temp experiments):

  1. the operation Collecting files and computing hashes in data/generated_datasets/default/patches/train takes a lot of time. (sometimes even 12 minutes) 🐢
  2. the operation Collecting files and computing hashes ... is being executed multiple times, which I don't understand why. 🤔

More precisely by running in verbose mode (-v), I get the following waiting times:

  1. It checkouts the generated data to the temp (from the local cache). This takes some time but it is understandable.
  2. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 1st time: (takes ~1 min with 14kfile/s)
.
.
2023-08-09 10:11:24,121 DEBUG: Assuming 'train' to be a stage inside 'dvc.yaml'
2023-08-09 10:11:24,225 DEBUG: Computed stage: 'data/raw/image_infos.csv.dvc' md5: 'None'
'data/raw/image_infos.csv.dvc' didn't change, skipping                                                                                                                                                             
2023-08-09 10:11:24,229 DEBUG: Computed stage: 'data/raw/images.dvc' md5: 'None'
2023-08-09 10:11:24,268 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'                                                                                                                            
'data/raw/images.dvc' didn't change, skipping                                                                                                                                                                      
2023-08-09 10:11:24,275 DEBUG: Computed stage: 'data/raw/annotations.dvc' md5: 'None'
2023-08-09 10:11:24,644 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'                                                                                                                            
'data/raw/annotations.dvc' didn't change, skipping                                                                                                                                                                 
2023-08-09 10:11:24,680 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'                                                                                                                            
2023-08-09 10:11:24,917 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'                                                                                                                            
Collecting files and computing hashes in data/generated_datasets/default/patches/train 
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 2nd time: (takes 12 mins with 700file/s)
.
.
Stage 'generate_data' didn't change, skipping                                                                                                                                                                      
2023-08-09 10:12:31,275 DEBUG: Computed stage: 'data/models/pretrained_model_template.state_dict.dvc' md5: 'None'
'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping                                                                                                                                     
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 3rd time: (takes 1 min with 14kfile/s)
.
.
'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping                                                                                                                                     
2023-08-09 10:24:58,062 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:24:58,170 DEBUG: Dependency 'data/generated_datasets/default/patches/train' of stage: 'train' changed because it is 'modified'.                                                                      
2023-08-09 10:24:58,172 DEBUG: stage: 'train' changed.
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.ckpt' of stage: 'train'.
2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.ckpt'
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.state_dict' of stage: 'train'.
2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.state_dict'
2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.pt' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.pt'
2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/stats_rgb.csv' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/stats_rgb.csv'
2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/hparams.yaml' of stage: 'train'.
2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/hparams.yaml'
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 4th time: (takes ~1 min with 14kfile/s)
.
.
2023-08-09 10:25:43,247 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:28:54,516 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 5th time: (takes ~1 min with 14kfile/s)
.
.                                                                                                                         
2023-08-09 10:29:38,244 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:29:38,341 DEBUG: {'data/generated_datasets/default/patches/train': 'modified'}                                                                                                                       
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 6th time: (takes ~1 min with 14kfile/s)
.
.                                                                                                                    
Collecting files and computing hashes in data/generated_datasets/default/patches/train                                                                                                   |301k [00:21, 15.5kfile/s]
2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 7th time: (takes ~1 min with 14kfile/s)
2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
2023-08-09 10:31:14,382 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:31:23,380 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. Finally, the train stage is invoked ✔️
2023-08-09 10:32:16,223 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
Running stage 'train':                                                                                                                                                                                             
> python3 code/training/run_training.py train.pl_environment.paths.dir_dvc_model_out=data/models/dvc_model `
  1. After train is done, once again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
[2023-08-09 10:39:15,126][segmenttools.data.copy][WARNING] - Folder data/models/dvc_model already exists. This folder will be replaced!
[2023-08-09 10:39:15,451][segmenttools.inference_pipeline.gpu_availability][INFO] - Released lock for GPU: 1 by process with id: 1916752
[2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. And again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
2023-08-09 10:40:00,411 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
2023-08-09 10:40:09,511 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
2023-08-09 10:40:10,658 DEBUG: Computed stage: 'train' md5: 'e4a099e883959330ae7111a6383d6006'                                                                                                                     
Collecting files and computing hashes in data/generated_datasets/default/patches/train
  1. And again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
2023-08-09 10:41:03,037 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
Collecting files and computing hashes in data/generated_datasets/default/patches/train

Expected

I would expect for the operation Collecting files and computing hashes in data/generated_datasets/default/patches/train not to be invoked so many times!

Environment information

Output of dvc doctor:

$ dvc doctor
-------------------------
Platform: Python 3.8.13 on Linux-5.4.0-153-generic-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 2.11.0
        dvc_objects = 0.24.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/deep/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Caches: local
Remotes: ssh, s3
Workspace directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/840dbd56066b2803ce384b8868549044

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: experimentsRelated to dvc expperformanceimprovement over resource / time consuming tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions