GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

jeaho322 · 2025-11-05T01:47:23Z

jeaho322
Nov 5, 2025

Hi,
I’m trying to train a PatchCore model using the Folder class to create the datamodule.
Since I’m working in an offline environment, I manually loaded the backbone weights via model.model.feature_extractor.

However, during training, GPU memory usage keeps increasing within a single epoch — it accumulates batch by batch until it eventually causes an out-of-memory error.
I’ve already tried lowering the sampling_ratio, as well as reducing both the batch_size and num_neighbors, but the issue persists.

Any advice or insights would be greatly appreciated. Thanks in advance!

Custom Train Code

from anomalib.data import Folder, PredictDataset
from anomalib.models import Patchcore
from anomalib.metrics import Evaluator, F1Score
from anomalib.engine import Engine
import torch, yaml
import pandas as pd

def load_yaml(path):
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)

def make_datamodule(path, mode):
cfg = load_yaml(path)
if mode == 'train':
datamodule = Folder(
name=cfg['dataset']['name'],
normal_dir=cfg['dataset']['normal_dir'],
train_batch_size=cfg['train']['train_batch_size'],

    )

else:
    datamodule = Folder(
        name=cfg['dataset']['name'],
        normal_dir=cfg['dataset']['predict_dir'],
    )

datamodule.setup()
return datamodule, cfg

def load_local_backbone_weights(feature_extractor, weight_path):
state = torch.load(weight_path, map_location="cpu")
if isinstance(state, dict) and "state_dict" in state:
state = state["state_dict"]

cleaned = {}
for k, v in state.items():
    nk = k
    if nk.startswith("module."):
        nk = nk[len("module."):]
    if nk.startswith("model."):
        nk = nk[len("model."):]
    cleaned[nk] = v

feature_extractor.load_state_dict(cleaned, strict=False)

if name == 'main':

CONFIG = {
"learning_rate": 1e-4,   
"local_weight_path": "./pretrained/wide_resnet50_2-95faca4d.pth",
"device": "cuda" if torch.cuda.is_available() else "cpu",

"max_epochs" : 1,
"log_every_n_steps" : 1,
"check_val_every_n_epoch" :1}


data_yaml = 'pc_config.yaml'
dm, data_cfg = make_datamodule(data_yaml, 'train')


model = Patchcore(
    pre_trained=False,  # False
    coreset_sampling_ratio=0.05,
    num_neighbors=4
)

cfg = CONFIG
load_local_backbone_weights(model.model.feature_extractor, cfg["local_weight_path"])

engine = Engine(
    logger=True,
    max_epochs=cfg["max_epochs"],
    log_every_n_steps=cfg["log_every_n_steps"],
    check_val_every_n_epoch=cfg["check_val_every_n_epoch"]        
)


engine.fit(model, datamodule=dm)

configs

dataset:
name : dataset
normal_dir : /home/sample
abnormal_dir : None

train:
train_batch_size : 32

model:
name: patchcore
backbone: wide_resnet50_2
pre_trained: true
layers:
- layer2
- layer3
coreset_sampling_ratio: 0.05

project:
seed: 42
path: ./results/patchcore_bottle

trainer:
accelerator: gpu
devices: 1
max_epochs: 1
precision: 16
enable_progress_bar: true

logging:
log_graph: false

Error

File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/namu/.local/lib/python3.10/site-packages/timm/models/resnet.py", line 257, in forward
x = self.conv3(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.

waschsalz · 2025-11-05T08:57:42Z

waschsalz
Nov 5, 2025

Hello @jeaho322

Have you updated your NVIDIA drivers - are they equivalent to the Torch CUDA version which is installed in your environment? The error message does not really point towards a memory issue.

Does nvidia-smi work?

0 replies

abc-125 · 2025-11-11T19:18:40Z

abc-125
Nov 11, 2025

Hello, if you have an out-of-memory error with PatchCore, you can try training on CPU instead of GPU. You can also decrease the image size or the number of images for training.

Lowering coreset_sampling_ratio, batch_size, or num_neighbors won't be helpful due to how PatchCore works. It extracts features from all images in the training set and then downsamples to reduce the size, which requires a lot of memory. More details can be found in the paper.

0 replies

samet-akcay · 2025-11-11T19:30:11Z

samet-akcay
Nov 11, 2025
Maintainer

I would recommend this PR #3105

AnomalyDINO can be considered as the modern Patchcore. with which you might get rid of OOM issues

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

Uh oh!

jeaho322 Nov 5, 2025

Custom Train Code

configs

Error

Replies: 3 comments

Uh oh!

waschsalz Nov 5, 2025

Uh oh!

abc-125 Nov 11, 2025

Uh oh!

samet-akcay Nov 11, 2025 Maintainer

jeaho322
Nov 5, 2025

waschsalz
Nov 5, 2025

abc-125
Nov 11, 2025

samet-akcay
Nov 11, 2025
Maintainer