-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Background
One downside in the way SPRAS distributes snakemake steps with HTCondor in CHTC/OSPool is that the workflow's inner PRM container images aren't being pulled through an appropriate image cache. When many SPRAS steps run at a given execution point (EP), potentially alongside 10s if not 100s of other users that landed on the same node and need to pull images, Docker may start to throttle the EP.
While some of @tristan-f-r's recent analysis hints that what we thought were throttling errors may actually be a different class of Docker errors, the potential to get EPs blacklisted by Docker is a good way to make CHTC sys admins grouchy.
Normal containerized jobs in HTCondor set in their submit description something along the lines of
universe = container
container_image = docker://foo/bar:latest
which tells HTCondor it's responsible for pulling/transferring the image when it's needed. As part of this, HTCondor manages a local cache of these images to avoid DDoSing external image repositories.
However, HTCondor doesn't have semantics for managing multiple, nested containers, which is needed by SPRAS workflows, so the inner image must be pulled by the job itself.
Proposed solution
When I last spoke with CHTC admins, they suggested we bake some extra config into the outer SPRAS image we use to run with HTCondor:
Pre-requisites (notes from Matt Westphall):
* Confirm that your apptainer version is 1.3.6 (docker mirror config is broken in 1.4) <-- Still true??
* Check that /etc/containers/registries.conf.d/registry-mirror.conf has contents along the lines of:
[[registry]]
location="docker.io"
[[registry.mirror]]
location="dockercache2007.chtc.wisc.edu"
When we try this approach, we should also come up with a way to convince ourselves the cache is actually being used as opposed to failing silently and skipping the cache altogether.