Description
May I ask which kind of GPU for project training? I use the NVIDIA RTX A6000 and attache 47.5 memory,but show out of memory.
The follow is my running error detail:
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Train Dataset Size: 480
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Validation Dataset Size: 60
[2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Test Dataset Size: 60
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:214: FutureWarning: torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
amp_ctx = torch.cuda.amp.autocast() if cfg.fp16 else contextlib.nullcontext()
/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:215: FutureWarning: torch.cuda.amp.GradScaler(args...)
is deprecated. Please use torch.amp.GradScaler('cuda', args...)
instead.
scaler = torch.cuda.amp.grad_scaler.GradScaler()
Epoch: 0%| | 0/1 [00:27<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 232, in main
val_tracker = eval_epoch(
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 74, in eval_epoch
for sp_input, targets, mask, fn in tqdm(
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/helper.py", line 14, in get_batch
for (feats, target_feats), coords, mask, filenames in dataloader:
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in next
data = self._next_data()
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 74, in getitem
cld = self.load(filename)
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 68, in load
self.cache[filename] = self.load_cloud(filename).pin_memory()
File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/data_types/cloud.py", line 112, in pin_memory
rgb = self.rgb.pin_memory() if self.rgb is not None else None
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
`