Skip to content

tensor descriptors do not match for grouped convolutions (legacy) #188

@jvwilliams23

Description

@jvwilliams23

Hi,
I was wondering, are grouped convolutions currently compatible with distconv/legacy dihydrogen?
When using a distconv-enabled grouped convolution in lbann, it gives the following error:

[0] [ERROR] setup algos forward: descriptors do not match:
 channel: Tensor descriptor: #dims=4, type=float, dims=1x256x128x128, strides=4194304x16384x128x1
 filter: Tensor descriptor: #dims=4, type=float, dims=1x256x128x128, strides=4194304x16384x128x1
 weights: Filter descriptor: format=NCHW, #dims=4, type=float, dims=256x1x3x3
****************************************************************
Caught signal 6 (SIGABRT - process abort signal) on rank 0
Stack trace:
   0: lbann::stack_trace::get[abi:cxx11]()
   1: lbann::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
   2: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/lib64/liblbann.so.0.105.0(+0x28b7822) [0x7f9a39eb7822] (could not find stack frame symbol)
   3: /lib64/libc.so.6(+0x3e6f0) [0x7f99c503e6f0] (could not find stack frame symbol)
   4: /lib64/libc.so.6(+0x8b94c) [0x7f99c508b94c] (could not find stack frame symbol)
   5: raise (demangling failed)
   6: abort (demangling failed)
   7: distconv::Convolution<distconv::DNNBackend<distconv::GPUDNNBackend>, float>::ensure_tensor_descriptors_conform(cudnnTensorStruct* const&, cudnnTensorStruct* const&, cudnnFilterStruct* const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
   8: distconv::Convolution<distconv::DNNBackend<distconv::GPUDNNBackend>, float>::setup_algorithms_fwd(void const*, void const*, void*, unsigned long)
   9: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/lib64/liblbann.so.0.105.0(+0x18c0390) [0x7f9a38ec0390] (could not find stack frame symbol)
  10: lbann::base_convolution_adapter<float, (hydrogen::Device)1>::fp_compute_convolution()
  11: lbann::convolution_layer<float, (lbann::data_layout)1, (hydrogen::Device)1>::fp_compute()
  12: lbann::data_type_layer<float, float>::forward_prop()
  13: lbann::model::forward_prop(lbann::execution_mode, bool)
  14: lbann::SGDTrainingAlgorithm::train_mini_batch(lbann::SGDExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::ScopeTimer)
  15: lbann::SGDTrainingAlgorithm::train(lbann::SGDExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::SGDTerminationCriteria const&)
  16: lbann::SGDTrainingAlgorithm::apply(lbann::ExecutionContext&, lbann::model&, lbann::data_coordinator&, lbann::execution_mode)
  17: lbann::trainer::train(lbann::model*, long long, long long)
  18: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/bin/lbann() [0x421b7f] (could not find stack frame symbol)
  19: /lib64/libc.so.6(+0x29590) [0x7f99c5029590] (could not find stack frame symbol)
  20: __libc_start_main (demangling failed)
  21: /home/jwilliams/lbann-builds/lbann-latest/build-full/install/bin/lbann() [0x421e55] (could not find stack frame symbol)
****************************************************************

but it works fine without distconv.

Regards,
Josh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions