[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs

Hello,

I’m running the `run_tests.sh` script found in the root of the project on a system with 8x A100 GPUs (40GB each). 
Internally, the script runs two pytest commands simultaneously:
- One for fast tests.
- One for all tests.

I noticed that the script uses pytest-xdist with the following options:
- `-n auto`: Starts the number of workers based on CPU cores.
- `--dist worksteal`: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.

However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.

Questions:
1. What is the recommended number of workers (-n) when running tests on a GPU instance?
  Should it be based on the number of GPUs, memory per GPU, or another factor?
2. Should the two pytest commands in run_tests.sh be executed sequentially?
They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?

Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions