usecase: building patches on top of an OCI container for ML

I'm currently working on https://github.com/pytorch/torchx which is a project trying to make it easier to train and deploy ML models. 

Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.

Container based services:
* Kubernetes / Volcano scheduler
* AWS EKS / Batch
* Google AI Platform training
* Recent versions of slurm https://slurm.schedmd.com/containers.html

TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493

Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.

It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see.

@vsoch curious what your thoughts are and if that's something you'd be interested in having merged into this repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

usecase: building patches on top of an OCI container for ML #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

usecase: building patches on top of an OCI container for ML #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions