-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Hi,
I was wondering, are grouped convolutions currently compatible with distconv/legacy dihydrogen?
When using a distconv-enabled grouped convolution in lbann, it gives the following error:
[0] [ERROR] setup algos forward: descriptors do not match:
channel: Tensor descriptor: #dims=4, type=float, dims=1x256x128x128, strides=4194304x16384x128x1
filter: Tensor descriptor: #dims=4, type=float, dims=1x256x128x128, strides=4194304x16384x128x1
weights: Filter descriptor: format=NCHW, #dims=4, type=float, dims=256x1x3x3
****************************************************************
Caught signal 6 (SIGABRT - process abort signal) on rank 0
Stack trace:
0: lbann::stack_trace::get[abi:cxx11]()
1: lbann::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
2: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/lib64/liblbann.so.0.105.0(+0x28b7822) [0x7f9a39eb7822] (could not find stack frame symbol)
3: /lib64/libc.so.6(+0x3e6f0) [0x7f99c503e6f0] (could not find stack frame symbol)
4: /lib64/libc.so.6(+0x8b94c) [0x7f99c508b94c] (could not find stack frame symbol)
5: raise (demangling failed)
6: abort (demangling failed)
7: distconv::Convolution<distconv::DNNBackend<distconv::GPUDNNBackend>, float>::ensure_tensor_descriptors_conform(cudnnTensorStruct* const&, cudnnTensorStruct* const&, cudnnFilterStruct* const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
8: distconv::Convolution<distconv::DNNBackend<distconv::GPUDNNBackend>, float>::setup_algorithms_fwd(void const*, void const*, void*, unsigned long)
9: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/lib64/liblbann.so.0.105.0(+0x18c0390) [0x7f9a38ec0390] (could not find stack frame symbol)
10: lbann::base_convolution_adapter<float, (hydrogen::Device)1>::fp_compute_convolution()
11: lbann::convolution_layer<float, (lbann::data_layout)1, (hydrogen::Device)1>::fp_compute()
12: lbann::data_type_layer<float, float>::forward_prop()
13: lbann::model::forward_prop(lbann::execution_mode, bool)
14: lbann::SGDTrainingAlgorithm::train_mini_batch(lbann::SGDExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::ScopeTimer)
15: lbann::SGDTrainingAlgorithm::train(lbann::SGDExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::SGDTerminationCriteria const&)
16: lbann::SGDTrainingAlgorithm::apply(lbann::ExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::execution_mode)
17: lbann::trainer::train(lbann::model*, long long, long long)
18: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/bin/lbann() [0x421b7f] (could not find stack frame symbol)
19: /lib64/libc.so.6(+0x29590) [0x7f99c5029590] (could not find stack frame symbol)
20: __libc_start_main (demangling failed)
21: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/bin/lbann() [0x421e55] (could not find stack frame symbol)
****************************************************************
but it works fine without distconv.
Regards,
Josh
Metadata
Metadata
Assignees
Labels
No labels