Adding TorchBench SDL / Container Example for Gpu Benchmarking 

Hello

The purpose of this issue to add an sdl/container for running benchmarks for pytorch  with [torchbench](https://github.com/pytorch/benchmark/tree/main) on gpu providers. 

I have written an sdl and dockerfile.  It is taking forever(>3hrs to dockerhub/ecr)  to push the docker image due to size. I may change our approach slightly to decrease the size of the image with a slight delay in runtime start.  I wanted to get some feedback from the gpu team/community before proceeding further.

We need to set some requirements for the provider benchmarks:

1. What is our timeout for large docker container pulls on Akash?
2. What is our desired time budget for running the benchmark how long to run?
3. What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have?
4. What models are most important?  The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user?  

Some added context:
The container image is around 20Gb uncompressed, 6Gb  of this is the pytorch runtime and the other 14GB are the models and code used for benchmarking. We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay. 

The actual benchmark itself would take about 5-8 hours to run if run sequentially on a macbook pro skipping the gpu benchmarks and Meta currently runs the benchmark on a gpu cluster.  

We don’t have to run every benchmark though and the fastest approach would be to run a small subset of relevant benchmarks with the torchbench repo installed delayed till runtime to decrease the container size.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding TorchBench SDL / Container Example for Gpu Benchmarking #387

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding TorchBench SDL / Container Example for Gpu Benchmarking #387

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions