Skip to content

Add support for motioncor2/motioncor3's Serial Flag for large runs / batch multiple images into 1 command invocation #27

@jpellman

Description

@jpellman

motioncor2 is primarily I/O bound. In cursory/non-rigorous tests, close to around 1/3 of the total runtime involves loading the gain, dark, defect and input images. The total walltime for a motioncor2 run will decrease by around 1/3 by warming the page cache / pre-loading these images.

In order to make more efficient usage of our GPUs, we'd like to exploit this fact. Here's what we've tried so far:

  • Slurm GRES Sharding: We can have two motioncor2 Dask workers warm the cache and each invocation of motioncor2 take a lock on a GPU before running to prevent more than one motioncor2 command from running on the GPU at a time (motioncor2 does not support running multiple motioncor2 processes on the same GPU simultaneously as discovered via testing). With the specific version of Slurm we use (20.11.7+really20.11.4-2+deb11u1 Debian 11), this does not work for the most part. We generally get this error on the nodes in the slurmd logs if we update gres.conf explicity:
fatal: Invalid GRES record for shard, count does not match File value

Someone else had the same issue, but there are no responses there. Without updating gres.conf, sharding appears to work, but the GPU device is not added to the batch job's cgroup (as confirmed via an interactive session via salloc). The Slurm docs mention the CUDA_VISIBLE_DEVICES environment variable being set, which makes me think that maybe cgroups aren't supported as an isolation mechanism when using shards.

  • dask-jobqueue allows you to run 2 workers in parallel within the same Slurm job by setting the processes keyword arg in the SLURMCluster constructor (see here). I updated the app logic itself to warm the cache within the pipeline and run motioncor2 with a GPU lock. This actually worsened performance somehow (~30s to ~60s runtime) and one or two images did not process properly (possibly due to a bug I introduced).

Here's what we haven't tried, but should try:

motioncor2/motioncor3 provide built-in logic for overlapping I/O with computation so that processing is performed while the inputs for the next image to be processed are being loaded in the background. Specifically, there is a Serial flag that implements this. For this to work, we'd need to group together inputs with common gains/dark references/defect maps into subdirectories, ideally also imposing a batch size limit (maybe 10 images per subdirectory?). We would then ideally want to move outputs back to the main processing directory and then have the post-task link them to the Leginon directory, so we'd need some sort of map.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions