Skip to content

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

Open
@apivovarov

Description

@apivovarov

Hello,

I’m running the run_tests.sh script found in the root of the project on a system with 8x A100 GPUs (40GB each).
Internally, the script runs two pytest commands simultaneously:

  • One for fast tests.
  • One for all tests.

I noticed that the script uses pytest-xdist with the following options:

  • -n auto: Starts the number of workers based on CPU cores.
  • --dist worksteal: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.

However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.

Questions:

  1. What is the recommended number of workers (-n) when running tests on a GPU instance?
    Should it be based on the number of GPUs, memory per GPU, or another factor?
  2. Should the two pytest commands in run_tests.sh be executed sequentially?
    They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?

Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions