Add support for motioncor2/motioncor3's Serial Flag for large runs / batch multiple images into 1 command invocation

`motioncor2` is primarily I/O bound.  In cursory/non-rigorous tests, close to around 1/3 of the total runtime involves loading the gain, dark, defect  and input images.  The total walltime for a _motioncor2_ run will decrease by around 1/3 by warming the page cache / pre-loading these images.

In order to make more efficient usage of our GPUs, we'd like to exploit this fact.  Here's what we've tried so far:

* [Slurm GRES Sharding](https://slurm.schedmd.com/gres.html#Sharding):  We can have two `motioncor2` Dask workers warm the cache and each invocation of `motioncor2` take a lock on a GPU before running to prevent more than one `motioncor2` command from running on the GPU at a time (`motioncor2` does not support running multiple `motioncor2` processes on the same GPU simultaneously as discovered via testing).  With the specific version of Slurm we use (`20.11.7+really20.11.4-2+deb11u1` Debian 11), this does not work for the most part.  We generally get this error on the nodes in the `slurmd` logs if we update `gres.conf` explicity:

```
fatal: Invalid GRES record for shard, count does not match File value
```

[Someone else](https://groups.google.com/g/slurm-users/c/hMg3RGUppV0/m/cF9J6PiBAAAJ) had the same issue, but there are no responses there.  Without updating `gres.conf`, sharding appears to work, but the GPU device is not added to the batch job's cgroup (as confirmed via an interactive session via `salloc`).  The Slurm docs mention the `CUDA_VISIBLE_DEVICES` environment variable being set, which makes me think that maybe cgroups aren't supported as an isolation mechanism when using shards.
* `dask-jobqueue` allows you to run 2 workers in parallel within the same Slurm job by setting the `processes` keyword arg in the `SLURMCluster`  constructor (see [here](https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html)).  I updated the app logic itself to warm the cache within the pipeline and run `motioncor2` with a GPU lock.  This actually worsened performance somehow (~30s to ~60s runtime) and one or two images did not process properly (possibly due to a  bug I introduced).

Here's what we haven't tried, but should try:

`motioncor2`/`motioncor3` provide built-in logic for overlapping I/O with computation so that processing is performed while the inputs for the next image to be processed are being loaded in the background.  Specifically, there is a ` Serial` flag that implements this.  For this to work, we'd need to group together inputs with common gains/dark references/defect maps into subdirectories, ideally also imposing a batch size limit (maybe 10 images per subdirectory?).  We would then ideally want to move outputs back to the main processing directory and then have the post-task link them to the Leginon directory, so we'd need some sort of map.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for motioncor2/motioncor3's Serial Flag for large runs / batch multiple images into 1 command invocation #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add support for motioncor2/motioncor3's Serial Flag for large runs / batch multiple images into 1 command invocation #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions