Skip to content

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

@dorotat-nv

Description

@dorotat-nv

BioNeMo Framework Version

4b59b06

Bug Description

The unit test sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] fails on L40

Steps to Reproduce

  1. Run the test on l40 with the following specification

12:12:10 Fri Mar 7 11:12:10 2025
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.8 |
12:12:10 |-----------------------------------------+------------------------+----------------------+
12:12:10 | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
12:12:10 | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
12:12:10 | | | MIG M. |
12:12:10 |=========================================+========================+======================|
12:12:10 | 0 NVIDIA L40 On | 00000000:C1:00.0 Off | 0 |
12:12:10 | N/A 31C P8 33W / 300W | 1MiB / 46068MiB | 0% Default |
12:12:10 | | | N/A |
12:12:10 +-----------------------------------------+------------------------+----------------------+
12:12:10
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | Processes: |
12:12:10 | GPU GI CI PID Type Process name GPU Memory |
12:12:10 | ID ID Usage |
12:12:10 |=========================================================================================|
12:12:10 | No running processes found |
12:12:10 +-----------------------------------------------------------------------------------------

Error Messages and Logs

12:23:15  sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_processes_special_characters PASSED [ 40%]
12:24:01  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_infer.py::test_run_infer PASSED [ 43%]
12:24:11  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_inference.py::test_infer_model_generates_expected_single_token_output PASSED [ 46%]
12:25:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_runs PASSED [ 50%]
12:28:09  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_evo2_runs PASSED [ 53%]
12:28:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_nv] PASSED [ 56%]
12:29:02  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] ci/scripts/run_pytest.sh: line 112:  9935 Killed                  pytest "${PYTEST_OPTIONS[@]}" --junitxml=$(basename $dir).junit.xml -o junit_family=legacy "$dir"
12:29:02  + exit_code=137
12:29:02  + [[ 137 -ne 0 ]]
12:29:02  + [[ false == true ]]
12:29:02  + echo 'Error: pytest failed with exit code 137'
12:29:02  Error: pytest failed with exit code 137
12:29:02  + error=true
12:29:02  + clean_pycache ./sub-packages/bionemo-evo2/
12:29:02  + local base_dir=./sub-packages/bionemo-evo2/
12:29:02  + echo 'Cleaning Python cache files in ./sub-packages/bionemo-evo2/...'

Docker Image

No response

System Information

Environment Details:

  • OS: [e.g., Ubuntu 20.04]
  • CPU: [e.g., Intel i9-12900K]
  • RAM: [e.g., 64GB]

GPU Details:

  • GPU Model: [e.g., NVIDIA RTX 4090]
  • GPU Memory: [e.g., 24GB]
  • CUDA Version: [e.g., 12.1]
  • CUDA Driver: [e.g., 525.85.05]
  • cuDNN Version: [e.g., 8.9.0]

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions