Description
Is your feature request related to a problem? Please describe.
It is not possible to checkpoint and restore running SingularityCE containers using in-built or simple methods.
Describe the solution you'd like
It should be possible to checkpoint and restore batch and interactive containers that are launched with singularity run/shell/exec
.
The solution should be similar to what is possible with podman
and docker
, so it is familiar to users working in a mixed OCI and Singularity environment.
Describe alternatives you've considered
The apptainer project has implemented checkpoint/restore of instances only, using DMTCP: apptainer/apptainer#109
This is definitely useful for checkpointing instances. However:
- Most workloads run with SingularityCE are
run/exec
batch jobs or interactiveshell
tasks, rather than instances. We anticipate most use of instances would be for persistent services which are likely to be able to maintain state themselves, across shutdown/startup. - It's not clear why the implementation is limited to instances... whether it's a technical limitation of the DMTCP approach or a prioritization. We ultimately want checkpoint/restore for batch and interactive run/shell/exec` so we won't bring in the apptainer code unless it's clear that is possible.
- There are some concerns posted around the internet about the impact of DMTCP wrapping of getpid() /and whether this is safe/compatible in all cases. The posts detail that (
/proc/n/
) might not be the process real PID on resume, wrapped getpid() will return the original, but using/proc/
directly. This may be old information but needs to be investigated and understood. EDIT - this appears to be out of date: PID virtualization fails for a simple program that reads from /proc dmtcp/dmtcp#461
On the plus side for DMTCP over CRIU - it supports some HPC relevant concepts that CRIU does not, or did not when the comparison was last updated: