Open
Description
Hello,
I’m running the run_tests.sh
script found in the root of the project on a system with 8x A100 GPUs (40GB each).
Internally, the script runs two pytest commands simultaneously:
- One for fast tests.
- One for all tests.
I noticed that the script uses pytest-xdist with the following options:
-n auto
: Starts the number of workers based on CPU cores.--dist worksteal
: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.
However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.
Questions:
- What is the recommended number of workers (-n) when running tests on a GPU instance?
Should it be based on the number of GPUs, memory per GPU, or another factor? - Should the two pytest commands in run_tests.sh be executed sequentially?
They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?
Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!
Thank you!
Metadata
Metadata
Assignees
Labels
No labels