Skip to content

RuntimeError: CUDA error: out of memory #13

Open
@away-back

Description

@away-back

May I ask which kind of GPU for project training? I use the NVIDIA RTX A6000 and attache 47.5 memory,but show out of memory.

The follow is my running error detail:
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Train Dataset Size: 480
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Validation Dataset Size: 60
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Test Dataset Size: 60
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:214: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
amp_ctx = torch.cuda.amp.autocast() if cfg.fp16 else contextlib.nullcontext()
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:215: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
scaler = torch.cuda.amp.grad_scaler.GradScaler()
Epoch: 0%| | 0/1 [00:27<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 232, in main
val_tracker = eval_epoch(
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 74, in eval_epoch
for sp_input, targets, mask, fn in tqdm(
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/helper.py", line 14, in get_batch
for (feats, target_feats), coords, mask, filenames in dataloader:
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in next
data = self._next_data()
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 74, in getitem
cld = self.load(filename)
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 68, in load
self.cache[filename] = self.load_cloud(filename).pin_memory()
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/data_types/cloud.py", line 112, in pin_memory
rgb = self.rgb.pin_memory() if self.rgb is not None else None
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions