Description
Executive summary: I've read the manual, but it's not clear to me why checkpointing a Java app running inside a docker container, should I checkpoint the docker container itself too; I just wanted to "snapshot" my running Java app? Could you kindly clarify why docker checkpoint is needed when performing a CRaC/criu for a java app running inside a docker container, even if I just collect the files in a persistent way, please?
Details
Hi,
I've been experimenting with this project following this video from Devoxx and this great tutorial.
Since I'm on Mac OSX (and not linux) I operate inside docker container.
My goal is to "snapshot" a running Java app, using CRaC/criu at a point in time and restore it, following the tutorials mentioned and the documentation I could find here on github.
Since I operate the CRaC inside a container because I'm on Mac OSX, I make sure the files are collected on a mounted volume, so I can mount them across container restarts.
I have created a banal Java app to test this, here: https://github.com/tarilabs/demo20230223-counting-on-crac
# create directory to host crac files dump
mkdir crac-files
# prepare a docker container as the lab environment to operate within
docker build -f src/main/docker/Dockerfile.jvm -t demo20230223-counting-on-crac .
docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
java -XX:CRaCCheckpointTo=/opt/crac-files $JAVA_OPTS -jar $JAVA_APP_JAR
In another shell I perfom:
docker exec -it -u root demo20230223-counting-on-crac /bin/bash
ps -u root
# typically java is PID 9, used below
jcmd 9 JDK.checkpoint
Up to here, everything works as expected, the app is checkpointed and dump files are created.
Now I want to restore, using the command:
java -XX:CRaCRestoreFrom=/opt/crac-files
I have tried 3 use-cases
Case A
If in the first shell, as the docker container is still running, I execute the restorefrom, it works.
Case B
If in the second shell, I capture a docker checkpoint with something ~like:
exit
docker ps -a
docker commit CONTAINER_ID demo20230223-counting-on-crac:checkpoint
then in the first shell, I restart from the checkpoint with something ~like:
exit
docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files -p 8080:8080 --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac:checkpoint
java -XX:CRaCRestoreFrom=/opt/crac-files
it works.
Case C
In the second shell, I just exit.
In the first shell, I just exit.
No container is running and no docker-checkpoint was taken.
In the first shell I go with:
docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
java -XX:CRaCRestoreFrom=/opt/crac-files
I get:
Error (criu/cr-restore.c:1506): Can't fork for 9: File exists
Error (criu/cr-restore.c:2593): Restoring FAILED.
I don't get why I cannot just restart the Java app from the dumped files (which are available across container restart as they are on the host disk), somehow additional status of the docker container must also be captured (with the docker checkpoint) ?
Is this a limitation of the system I'm using Mac OSX, and if I was Linux I could have turned off and turned on the linux computer across Java app checkpoint and restore?
Thanks!