Skip to content

update handling of GPU workflows in runTheMatrix.py #46069

Open
@fwyzard

Description

@fwyzard

runTheMatrix.py has some GPU-related options:

GPU-related options:
  These options are only meaningful when --gpu is used, and is not set to forbidden.

  --gpu [{forbidden,optional,required}], --requires-gpu [{forbidden,optional,required}]
                        Enable GPU workflows. Possible options are "forbidden" (default), "required" (implied if no argument is given), or "optional". (default: forbidden)
  --gpu-memory GPUMEMORYMB
                        Specify the minimum amount of GPU memory required by the job, in MB. (default: 8000)
  --cuda-capabilities CUDACAPABILITIES
                        Specify a comma-separated list of CUDA "compute capabilities", or GPU hardware architectures, that the job can use. (default: 6.0,6.1,6.2,7.0,7.2,7.5,8.0,8.6)
  --cuda-runtime CUDARUNTIME
                        Specify major and minor version of the CUDA runtime used to build the application. (default: 12.4)
  --force-gpu-name GPUNAME
                        Request a specific GPU model, e.g. "Tesla T4" or "NVIDIA GeForce RTX 2080". The default behaviour is to accept any supported GPU. (default: )
  --force-cuda-driver-version CUDADRIVERVERSION
                        Request a specific CUDA driver version, e.g. 470.57.02. The default behaviour is to accept any supported CUDA driver version. (default: )
  --force-cuda-runtime-version CUDARUNTIMEVERSION
                        Request a specific CUDA runtime version, e.g. 11.4. The default behaviour is to accept any supported CUDA runtime version. (default: )

However, they affect only the creation of WMAgent (?) workflows, not the actual content of the workflow generated by cmsDriver.py and executed by cmsRun.


I would like to propose two changes:

  1. change the default for the --gpu option from forbidden to optional;
  2. propagate the meaning of the --gpu option to cmsDriver, via the --accelerators option.

The first change is IMHO something we should do in its own right, but here it is motivated by minimising the impact of the second change on the cmsDriver workflows.


The second change proposes to map:

  • --gpu optional to the current behaviour, that is, no extra cmsDriver options
  • --gpu forbidden to cmsDriver.py --accelerators cpu
  • --gpu required to cmsDriver.py --accelerators gpu-*

By default cmsDriver does not impose any restrictions on the usage of GPUs.
Passing --accelerators cpu sets the job's process.options.accelerators to [ 'cpu' ], which prevents the use of GPUs in a CUDA or Alpaka workflow.
Passing --accelerators gpu-* sets the job's process.options.accelerators to [ 'gpu-*' ], which requires the use of GPUs in a CUDA or Alpaka workflow.

The advantage of this approach is that we no longer need to triplicate all Alpaka-related workflows: one version to run on any backend, one version to run only on CPU, one version to run only on GPUs.


As this change would affect O&C and PPD operations, what is their opinion ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions