predict_evo2 cannot use multiple GPUs

I'm using the NGC container nvcr.io/nvidia/clara/bionemo-framework:2.6 to do prediction with Evo2 40b 1M model, which is downloaded and converted by the evo2_convert_to_nemo2 command. I have 2*H800(compute capability=9.0) on my node with below command, and got the CUDA out-of-memory error. It looks like it's only using one GPU with 80GB instead of two with 160GB in total, though torch.cuda.device_count() returns 2. Could you help to see what I did wrong with the setting? What is the correct steps to run predict_evo2 on multiple GPUs?  Thanks!

`predict_evo2 --fasta sequences.fasta --ckpt-dir models/bionemo_evo2_40b_1m --output-dir predictions --model-size 40b_arc_longcontext --tensor-parallel-size 1 --pipeline-model-parallel-size 1 --context-parallel-size 1 --output-log-prob-seqs --fp8`

Below is the error message:
```
[INFO     | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: True
[INFO     | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
[INFO     | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs
[NeMo W 2025-06-03 17:18:08 nemo_logging:405] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2025-06-03 17:18:08 nemo_logging:405] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /tmp/tmp0k0d6stx
[INFO     | pytorch_lightning.utilities.rank_zero]: ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:09 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:10 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:11 random:220] CPU RNG state changed within GPU RNG context
[NeMo W 2025-06-03 17:18:11 random:220] CPU RNG state changed within GPU RNG context
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2025-06-03 17:18:11 nemo_logging:405] Could not copy Trainer's 'max_steps' to LR scheduler's 'max_steps'. If you are not using an LR scheduler, this warning can safely be ignored.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/predict_evo2", line 10, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/predict.py", line 416, in main
[rank0]:     predict(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/predict.py", line 407, in predict
[rank0]:     trainer.predict(model, datamodule=datamodule)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 858, in predict
[rank0]:     return call._call_and_handle_interrupt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 897, in _predict_impl
[rank0]:     results = self._run(model, ckpt_path=ckpt_path)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 957, in _run
[rank0]:     self.strategy.setup(self)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 419, in setup
[rank0]:     self.setup_megatron_parallel(trainer)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 570, in setup_megatron_parallel
[rank0]:     self.init_model_parallel()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 596, in init_model_parallel
[rank0]:     self.megatron_parallel.init_model_parallel()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 635, in init_model_parallel
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 631, in init_model_parallel
[rank0]:     self.init_ddp()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 681, in init_ddp
[rank0]:     dist_module = DDP(
[rank0]:                   ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 894, in __init__
[rank0]:     super().__init__(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/distributed_data_parallel.py", line 246, in __init__
[rank0]:     self.buffers, self.bucket_groups = _allocate_buffers_for_parameters(
[rank0]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/distributed_data_parallel.py", line 176, in _allocate_buffers_for_parameters
[rank0]:     _ParamAndGradBuffer(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/param_and_grad_buffer.py", line 618, in __init__
[rank0]:     self.grad_data = torch.zeros(
[rank0]:                      ^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.59 GiB. GPU 0 has a total capacity of 79.11 GiB of which 1.35 GiB is free. Including non-PyTorch memory, this process has 77.75 GiB memory in use. Of the allocated memory 76.59 GiB is allocated by PyTorch, and 12.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict_evo2 cannot use multiple GPUs #910

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

predict_evo2 cannot use multiple GPUs #910

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions