RuntimeError: CUDA error: out of memory

May I ask which kind of  GPU for project training?  I use the NVIDIA RTX A6000 and attache 47.5 memory,but show out of memory.

The follow is my running error detail：
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Train Dataset Size: 480
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Validation Dataset Size: 60
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Test Dataset Size: 60
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:214: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  amp_ctx = torch.cuda.amp.autocast() if cfg.fp16 else contextlib.nullcontext()
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:215: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = torch.cuda.amp.grad_scaler.GradScaler()
Epoch:   0%|                                                                                      | 0/1 [00:27<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 232, in main
    val_tracker = eval_epoch(
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 74, in eval_epoch
    for sp_input, targets, mask, fn in tqdm(
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/helper.py", line 14, in get_batch
    for (feats, target_feats), coords, mask, filenames in dataloader:
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 74, in __getitem__
    cld = self.load(filename)
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 68, in load
    self.cache[filename] = self.load_cloud(filename).pin_memory()
  File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/data_types/cloud.py", line 112, in pin_memory
    rgb = self.rgb.pin_memory() if self.rgb is not None else None
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: CUDA error: out of memory #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: out of memory #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions