Skip to content

NEConvolutionLayer - performance issue with i8 input / f32 output #1213

@alvoron

Description

@alvoron

NEConvolution takes more time to process the i8 / i8 / f32 / f32 (src0 / src1 / src2 / dst) new case than f32 / f32 / f32 / f32 case.

Benchmark was run on Apple M2 Pro.
Benchmark results:
i8 / i8 / f32 / f32: 43 - 47 ms
f32 / f32 / f32 / f32: 37 - 41 ms
i8 / i8 / i32 / i8: 29 - 30 ms

Benchmark program:

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/NEON/Accessor.h"
#include <iostream>
#include <vector>

using namespace arm_compute;

int main(int argc, char *argv[]) {
  DataLayout dl = DataLayout::NHWC;
  TensorInfo srcTensorInfo = TensorInfo(TensorShape(64, 56, 56), 1, DataType::F32, dl);
  TensorInfo weiTensorInfo = TensorInfo(TensorShape(64, 3, 3, 64), 1, DataType::F32, dl);
  TensorInfo biaTensorInfo = TensorInfo(TensorShape(64), 1, DataType::F32, dl);
  TensorInfo dstTensorInfo = TensorInfo(TensorShape(64, 56, 56), 1, DataType::F32, dl);

  if(is_data_type_quantized(dt)) {
    srcTensorInfo.set_quantization_info(QuantizationInfo(1.0));
  }

  PadStrideInfo strideInfo = PadStrideInfo(1, 1, 1, 1, DimensionRoundingType::FLOOR);

  auto status = NEConvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, &biaTensorInfo, &dstTensorInfo, strideInfo);
  if(status.error_code() != ErrorCode::OK) {
    std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
    exit(1);
  }
  std::cout << "PASSED VALIDATION" << std::endl;

  Tensor srcTensor;
  Tensor weiTensor;
  Tensor biaTensor;
  Tensor dstTensor;
  srcTensor.allocator()->init(srcTensorInfo);
  weiTensor.allocator()->init(weiTensorInfo);
  biaTensor.allocator()->init(biaTensorInfo);
  dstTensor.allocator()->init(dstTensorInfo);

  NEConvolutionLayer conv;
  conv.configure(&srcTensor, &weiTensor, &biaTensor, &dstTensor, strideInfo);
  std::cout << "PASSED CONFIGURATION" << std::endl;

  srcTensor.allocator()->allocate();
  weiTensor.allocator()->allocate();
  dstTensor.allocator()->allocate();

  // warm-up
  conv.run();

  std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) conv.run();
  std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
  uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
  std::cout << "time: " << total_duration << std::endl;

  srcTensor.allocator()->free();
  weiTensor.allocator()->free();
  biaTensor.allocator()->free();
  dstTensor.allocator()->free();

  return 0;
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions