Description
I'm currently working on https://github.com/pytorch/torchx which is a project trying to make it easier to train and deploy ML models.
Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.
Container based services:
- Kubernetes / Volcano scheduler
- AWS EKS / Batch
- Google AI Platform training
- Recent versions of slurm https://slurm.schedmd.com/containers.html
TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493
Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.
It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see.
@vsoch curious what your thoughts are and if that's something you'd be interested in having merged into this repo