Although enabling vramfs, cuda-oom happens

Hello.  I want to use `vramfs` as a swap space for Nvidia GPU Memory. 
So after reading the README.md file, I set vramfs to 20GB space. 
When I executed the `nvidia-smi` command, I was happy to see that `vramfs` was grabbed as 20GB as shown below. 

```bash
# vramfs /tmp/vram 20G
# nvidia-smi
Tue Jun 18 13:22:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:21:00.0 Off |                    0 |
| N/A   40C    P0              65W / 300W |  76773MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2867      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   1856687      C   bin/vramfs                                20892MiB | <--- 20GB for OpenCL
|    0   N/A  N/A   1906793      C   /opt/conda/bin/python3.10                 51754MiB |
|    0   N/A  N/A   1988805      C   /usr/bin/python                            2670MiB |
|    0   N/A  N/A   3729345      C   /usr/bin/python                            1418MiB |
+---------------------------------------------------------------------------------------+
```


And then, I also created the /tmp/vram/swapfile with 10GB as follows. 
```bash 
# cd /tmp/vram
# LOOPDEV=$(losetup -f)
# truncate -s 10G swapfile # replace 10G with target swapspace size, has to be smaller than the allocated vramfs (e.g. 20G)
# losetup $LOOPDEV swapfile
# mkswap $LOOPDEV
# swapon $LOOPDEV
# cat /proc/swaps
   Filename                                Type            Size            Used            Priority
   /dev/loop7                              partition       10485756        0               -3

# vi /etc/security/limits.conf
leemgs hard memlock unlimited
leemgs soft memlock unlimited
leemgs hard rtprio unlimited
leemgs soft rtprio unlimited
```

However, when I used an open source project called axolotl (https://github.com/OpenAccess-AI-Collective/axolotl) to run the model training as shown below, I got a cuda-oom error (e.g., `torch.cuda.OutOfMemoryError: CUDA out of memory`). I got a cuda-oom error when I ran the model training like below. 
```
$ accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
```

* log messages:
```bash
 ........... Omission ....................
[2024-06-18 13:11:25,615] [DEBUG] [axolotl.load_tokenizer:216] [PID:3288778] [RANK:0] EOS: 2 / </s>
[2024-06-18 13:11:25,615] [DEBUG] [axolotl.load_tokenizer:217] [PID:3288778] [RANK:0] BOS: 1 / <s>
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.load_tokenizer:218] [PID:3288778] [RANK:0] PAD: 2 / </s>
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.load_tokenizer:219] [PID:3288778] [RANK:0] UNK: 0 / <unk>
[2024-06-18 13:11:25,616] [INFO] [axolotl.load_tokenizer:224] [PID:3288778] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.train.log:61] [PID:3288778] [RANK:0] loading model and peft_config...
[2024-06-18 13:11:25,862] [INFO] [axolotl.load_model:280] [PID:3288778] [RANK:0] patching with flash attention for sample packing
[2024-06-18 13:11:25,862] [INFO] [axolotl.load_model:366] [PID:3288778] [RANK:0] patching _expand_mask
/home/guest/.local/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
[2024-06-18 13:11:32,028] [ERROR] [axolotl.load_model:591] [PID:3288778] [RANK:0] CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 480, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 841, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 49, in <module>
    fire.Fire(do_cli)
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 33, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 45, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/train.py", line 65, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 592, in load_model
    raise err
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 480, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 841, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/guest/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'axolotl.cli.train', 'examples/openllama-3b/lora.yml']' returned non-zero exit status 1.

# cat /proc/swaps
Filename                                Type            Size            Used            Priority
/dev/loop7                              partition       10485756        0               -2

```
As you can see, the used swap space of /dev/loop7 is still 0. It's weird. 

So I was wondering, is it possible to use `vramfs` as a swap space for Nvidia GPUs by using vramfs? Welcome to any hints or clue. 
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Although enabling vramfs, cuda-oom happens #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Although enabling vramfs, cuda-oom happens #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions