Skip to content

Segfaults with cudnn>=9.11 for pre-Turing devices (<=sm_70) #124

@h-vetinari

Description

@h-vetinari

When preparing the pytorch v2.8 release, @mgorny ran into a bunch of segfaults. After some painful debugging against the 2.7 branch (diff environments against last known passing run, try to recreate passing run, then relax constraints again one by one), the conclusion was that we need to pin to cudnn <9.11, but not really why that occurs.

Segfaults looked like

........................................................................ [ 23%]
Fatal Python error: Segmentation fault

Thread 0x00007f295ffff640 (most recent call first):
  <no Python frame>

Thread 0x00007f2b48cc6640 (most recent call first):

resp.

=================================== FAILURES ===================================
____________________________ test/test_autograd.py _____________________________
[gw0] linux -- Python 3.13.5 $PREFIX/bin/python3.13
worker 'gw0' crashed while running 'test/test_autograd.py::TestAutogradDeviceTypeCUDA::test_rnn_backward_to_input_but_not_parameters_cuda'
_____________________________ test/test_modules.py _____________________________
[gw1] linux -- Python 3.13.5 $PREFIX/bin/python3.13
worker 'gw1' crashed while running 'test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_BatchNorm1d_eval_mode_cuda_float32'
_______________________________ test/test_nn.py ________________________________
[gw2] linux -- Python 3.13.5 $PREFIX/bin/python3.13
worker 'gw2' crashed while running 'test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_cudnn_cuda'
_____________________________ test/test_modules.py _____________________________
[gw3] linux -- Python 3.13.5 $PREFIX/bin/python3.13
worker 'gw3' crashed while running 'test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_BatchNorm1d_eval_mode_cuda_float64'
_______________________________ test/test_nn.py ________________________________
[gw4] linux -- Python 3.13.5 $PREFIX/bin/python3.13
worker 'gw4' crashed while running 'test/test_nn.py::TestNN::test_RNN_change_dropout'
================== xdist: maximum crashed workers reached: 4 ===================

but note the maximum crashed workers reached, so there's likely many more.

One way to test this for the @conda-forge/cudnn folks would be to install pytorch v2.7.1 (where we only have a cudnn >=9.10.1.4,<10.0a0 constraint), and then run the test suite. For python v2.8.0, you'd have to destructively alter the environment (e.g. copy newer cudnn into $PREFIX), because the metadata won't let the solver install newer cudnn.

Obviously we want to get rid of the upper bound ASAP, because other feedstocks pulling in cudnn v9.x will create >=9.x,<10 run-exports, and then be incompatible with pytorch v2.8.

Sidenote: @carterbox spoke of an ABI break between pytorch v2.7.0 and v2.7.1 related to v2.7.1 having been built against the pybind v3 ABI, but AFAIU that is only a question for packages building on top of pytorch, not pytorch itself; so I don't think this is the reason, but thought I'd mention it for completeness.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions