Try tweaking the config:
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cuda.matmul.allow_tf32 = FalseDynamic code generation is usually the cause for slow compilation. You could disable features related to it to speed up compilation. But this might slow down your inference.
Disable JIT optimized execution (fusion). This can significantly speed up compilation.
# Wrap your code in this context manager
with torch.jit.optimized_execution(False):
# Do your thingsOr disable it globally.
torch.jit.set_fusion_strategy([('STATIC', 0), ('DYNAMIC', 0)])Disable Triton (not suggested).
config.enable_triton = FalseWhen your GPU VRAM is insufficient or the image resolution is high, CUDA Graph could cause less efficient VRAM utilization and slow down the inference.
config.enable_cuda_graph = FalseTriton might be not working properly because it uses cache to store compiled kernels,
especially when you just upgrade stable-fast or triton.
You could try to clear the cache to fix it.
rm -rf ~/.tritonEven in PyTorch's own implementation torch.compile, I have encountered crashes and segmentation faults.
It is usually caused by Triton, CUDA Graph or cudaMallocAsync because they are not stable enough.
You could try to remove the PYTORCH_CUDA_MALLOC_CONF=backend:cudaMallocAsync environment variable
and disable Triton and CUDA Graph to fix it.
config.enable_triton = False
# or
config.enable_cuda_graph = FalseImportError: DLL load failed while importing _C: The specified module could not be found
Make sure you have installed torch with CUDA support and your installed version is compatible with your Python and CUDA version.