Skip to content

Adding TorchBench SDL / Container Example for Gpu Benchmarking  #387

@rakataprime

Description

@rakataprime

Hello

The purpose of this issue to add an sdl/container for running benchmarks for pytorch with torchbench on gpu providers.

I have written an sdl and dockerfile. It is taking forever(>3hrs to dockerhub/ecr) to push the docker image due to size. I may change our approach slightly to decrease the size of the image with a slight delay in runtime start. I wanted to get some feedback from the gpu team/community before proceeding further.

We need to set some requirements for the provider benchmarks:

  1. What is our timeout for large docker container pulls on Akash?
  2. What is our desired time budget for running the benchmark how long to run?
  3. What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have?
  4. What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user?

Some added context:
The container image is around 20Gb uncompressed, 6Gb of this is the pytorch runtime and the other 14GB are the models and code used for benchmarking. We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay.

The actual benchmark itself would take about 5-8 hours to run if run sequentially on a macbook pro skipping the gpu benchmarks and Meta currently runs the benchmark on a gpu cluster.

We don’t have to run every benchmark though and the fastest approach would be to run a small subset of relevant benchmarks with the torchbench repo installed delayed till runtime to decrease the container size.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions