-
Notifications
You must be signed in to change notification settings - Fork 126
Description
BioNeMo Framework Version
Bug Description
The unit test sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] fails on L40
Steps to Reproduce
- Run the test on l40 with the following specification
12:12:10 Fri Mar 7 11:12:10 2025
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.8 |
12:12:10 |-----------------------------------------+------------------------+----------------------+
12:12:10 | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
12:12:10 | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
12:12:10 | | | MIG M. |
12:12:10 |=========================================+========================+======================|
12:12:10 | 0 NVIDIA L40 On | 00000000:C1:00.0 Off | 0 |
12:12:10 | N/A 31C P8 33W / 300W | 1MiB / 46068MiB | 0% Default |
12:12:10 | | | N/A |
12:12:10 +-----------------------------------------+------------------------+----------------------+
12:12:10
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | Processes: |
12:12:10 | GPU GI CI PID Type Process name GPU Memory |
12:12:10 | ID ID Usage |
12:12:10 |=========================================================================================|
12:12:10 | No running processes found |
12:12:10 +-----------------------------------------------------------------------------------------
Error Messages and Logs
12:23:15 sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_processes_special_characters PASSED [ 40%]
12:24:01 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_infer.py::test_run_infer PASSED [ 43%]
12:24:11 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_inference.py::test_infer_model_generates_expected_single_token_output PASSED [ 46%]
12:25:48 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_runs PASSED [ 50%]
12:28:09 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_evo2_runs PASSED [ 53%]
12:28:48 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_nv] PASSED [ 56%]
12:29:02 sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] ci/scripts/run_pytest.sh: line 112: 9935 Killed pytest "${PYTEST_OPTIONS[@]}" --junitxml=$(basename $dir).junit.xml -o junit_family=legacy "$dir"
12:29:02 + exit_code=137
12:29:02 + [[ 137 -ne 0 ]]
12:29:02 + [[ false == true ]]
12:29:02 + echo 'Error: pytest failed with exit code 137'
12:29:02 Error: pytest failed with exit code 137
12:29:02 + error=true
12:29:02 + clean_pycache ./sub-packages/bionemo-evo2/
12:29:02 + local base_dir=./sub-packages/bionemo-evo2/
12:29:02 + echo 'Cleaning Python cache files in ./sub-packages/bionemo-evo2/...'Docker Image
No response
System Information
Environment Details:
- OS: [e.g., Ubuntu 20.04]
- CPU: [e.g., Intel i9-12900K]
- RAM: [e.g., 64GB]
GPU Details:
- GPU Model: [e.g., NVIDIA RTX 4090]
- GPU Memory: [e.g., 24GB]
- CUDA Version: [e.g., 12.1]
- CUDA Driver: [e.g., 525.85.05]
- cuDNN Version: [e.g., 8.9.0]
Additional Context
No response