Replies: 3 comments
-
|
Hello @jeaho322 Have you updated your NVIDIA drivers - are they equivalent to the Torch CUDA version which is installed in your environment? The error message does not really point towards a memory issue. Does |
Beta Was this translation helpful? Give feedback.
-
|
Hello, if you have an out-of-memory error with PatchCore, you can try training on CPU instead of GPU. You can also decrease the image size or the number of images for training. Lowering |
Beta Was this translation helpful? Give feedback.
-
|
I would recommend this PR #3105 AnomalyDINO can be considered as the modern Patchcore. with which you might get rid of OOM issues |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I’m trying to train a PatchCore model using the Folder class to create the datamodule.
Since I’m working in an offline environment, I manually loaded the backbone weights via model.model.feature_extractor.
However, during training, GPU memory usage keeps increasing within a single epoch — it accumulates batch by batch until it eventually causes an out-of-memory error.
I’ve already tried lowering the sampling_ratio, as well as reducing both the batch_size and num_neighbors, but the issue persists.
Any advice or insights would be greatly appreciated. Thanks in advance!
Custom Train Code
from anomalib.data import Folder, PredictDataset
from anomalib.models import Patchcore
from anomalib.metrics import Evaluator, F1Score
from anomalib.engine import Engine
import torch, yaml
import pandas as pd
def load_yaml(path):
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
def make_datamodule(path, mode):
cfg = load_yaml(path)
if mode == 'train':
datamodule = Folder(
name=cfg['dataset']['name'],
normal_dir=cfg['dataset']['normal_dir'],
train_batch_size=cfg['train']['train_batch_size'],
def load_local_backbone_weights(feature_extractor, weight_path):
state = torch.load(weight_path, map_location="cpu")
if isinstance(state, dict) and "state_dict" in state:
state = state["state_dict"]
if name == 'main':
configs
dataset:
name : dataset
normal_dir : /home/sample
abnormal_dir : None
train:
train_batch_size : 32
model:
name: patchcore
backbone: wide_resnet50_2
pre_trained: true
layers:
- layer2
- layer3
coreset_sampling_ratio: 0.05
project:
seed: 42
path: ./results/patchcore_bottle
trainer:
accelerator: gpu
devices: 1
max_epochs: 1
precision: 16
enable_progress_bar: true
logging:
log_graph: false
Error
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/namu/.local/lib/python3.10/site-packages/timm/models/resnet.py", line 257, in forward
x = self.conv3(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.
Beta Was this translation helpful? Give feedback.
All reactions