-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Hello
The purpose of this issue to add an sdl/container for running benchmarks for pytorch with torchbench on gpu providers.
I have written an sdl and dockerfile. It is taking forever(>3hrs to dockerhub/ecr) to push the docker image due to size. I may change our approach slightly to decrease the size of the image with a slight delay in runtime start. I wanted to get some feedback from the gpu team/community before proceeding further.
We need to set some requirements for the provider benchmarks:
- What is our timeout for large docker container pulls on Akash?
- What is our desired time budget for running the benchmark how long to run?
- What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have?
- What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user?
Some added context:
The container image is around 20Gb uncompressed, 6Gb of this is the pytorch runtime and the other 14GB are the models and code used for benchmarking. We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay.
The actual benchmark itself would take about 5-8 hours to run if run sequentially on a macbook pro skipping the gpu benchmarks and Meta currently runs the benchmark on a gpu cluster.
We don’t have to run every benchmark though and the fastest approach would be to run a small subset of relevant benchmarks with the torchbench repo installed delayed till runtime to decrease the container size.