Description
Recently, I merged #26236 which changed the defaults for CHPL_LAUNCHER.
Prior to that PR, that logic was
- if we are on a cray-x* or hpe-cray-*, use a slurm/aprun launcher as available
- If COMM=gasnet, use a gasnet launcher
- if we are on cray-cs or hpe-apollo and slurm is available, use a slurm gasnet launcher
- otherwise, if we are on cray-cs or hpe-apollo, use slurm as available
See #26170 for a more indepth discussion of the history here, but essentially these defaults arose simply because it was the easiest thing to do for our testing infrastructure. Based on the fact that users on a slurm system probably want to be using slurm (and not accidentally running on a login node), #26236 changed the default to the following
- if we are on a cray-x* or hpe-cray-*, use a slurm/aprun launcher as available
- If COMM=gasnet, use a gasnet launcher
- if slurm is available, use a slurm gasnet launcher
- otherwise, use slurm as available
This meant any generic system that also had slurm installed would default to a slurm launcher, regardless of the value of CHPL_COMM. This was for a number of reasons
- The previous behavior of defaulting to LAUNCHER=none seems to have come from a desire to make our test infrastructure easier, not about whats best for users.
- If a user is running on a system with slurm installed, even with COMM=none, its more likely they are looking to run Chapel code on the compute nodes using slurm.
- e.g., they are looking to debug chapel code with COMM=none on a compute node
- e.g., they really only need a single node and want to compile with COMM=none to reduce the generated code that comes from a multi-locale build
- e.g., they are running GPU code on a cluster and don't need/want multinode. And more than likely, there are no GPUs on the login node
- Why would a user want to run on a slurm cluster and not use a slurm launcher? The only thing that comes to mind for me is a large system with long queue wait times. But if thats the case, the login node is likely a bad place to run large computations.
This didn't feel like a big change, as on HPE/Cray systems with slurm the default LAUNCHER for COMM=none was already slurm. All this changed was removing the platform specialization. However, after making these changes there was a feeling that this was a step too far. This issue is about the specific case of defaulting to CHPL_LAUNCHER=slurm-srun when CHPL_COMM=none, slurm is available, and we are on a generic clsuter (i.e. not a specific HPE/Cray system)