Skip to content

Improving training speed and CPU usage (on tamia) #31

@plbenveniste

Description

@plbenveniste

These recommendations were made by the tamia folks. I told them that I was not extensively using the GPUS (training 5 models for 4 GPUs) because my trainings are very CPU hungry. Otherwise, the training would take too long.

Here is there answer:

Well, that's an important clue. Data-loading pipeline. But there might still be improvements that increase your CPU efficiency and thus decrease your CPU:GPU ratio, thus increasing your GPU efficiency.

  1. The cheap gain might just be in the software package builds that you use. On TamIA, you have access to dual 24C (=48 cores total) Intel Sapphire Rapids CPUs of almost latest generation.
    - If you are not using an AVX-512-enabled build of software, you should; It might give a 2x gain or more for free.
    - But I just don't know what software you have installed and from where. The ticket doesn't have your environment, sbatch script or anything else.
  2. There might be algorithmic changes that increase efficiency.
    - For example, there is usually no sense in image pipelines in performing colour-space conversion and filtering before resizing. If you do resizing first, the rest of the augmentation pipeline needs to handle a lot fewer pixels.
    - But every pipeline and workload is different. Consider carefully what you can and can't get away with.
    - Each of the machines has 500G RAM. It might be legitimate for you to preload the dataset once into memory, if I/O is the bottleneck, or predecode it once into memory if CPU is, provided you have the space.
  3. Accidental misconfigurations of multi-processing/multi-threading.
    - It is surprisingly common to forget to set the number of processes spawned and the number of threads within each correctly. As a result, a common pathology is that auto detection detects the full number of cores (48) of the machine, spawns 48 workers, and each of the workers autodetects the same number of cores and thinks it's entitled to 48 threads each. Then you have way too many threads stepping on each others' toes, and the entire program proceeds slower than it could because of the constant switching.
    - The solution to this is to enforce the right number of threads for the right number of processes. PyTorch has torch.set_num_threads(), Numpy has its own thing, OpenCV has its own thing, MKL and OpenMP software have the MKL_NUM_THREADS and OMP_NUM_THREADS variable, etc. But you often have to set these yourself, and carefully think what their value is. For example, if you have 48 cores and want 1 process per GPU, then you want 12 threads per process at most. If for each GPU you spawn 4 data-loader processes, then each one should be given 48/4/4=3 CPU cores only.
  4. Alternative data-loading techniques.
    - NVIDIA offers GPU-accelerated image data-loading with the DALI framework. It does a lot of the very expensive stuff on-device using GPUs' texture samplers.

These should be explored to improve training speed on tamia.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions