The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow

## context
In SSE version of ShuffleChannel, the last channel is shuffled from the offset with the half of the granularity.
Here, it causes buffer-overflow at the load of the `ptr1` at the last iteration of the for loop.

For example, in the case of `(elempack == 4) && (_group == 2 && channels % _group != 0)` in [AVX512 optimization](https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L602), 

The `ptr1` initially can be accessed for the range `[ptr1, ptr1+4*size)`, 
and the range is reduced into `[ptr1, ptr1+4*size-2)` after `ptr1 += 2;`. 
However, at the last iteration of the for loop, it loads `[ptr1+4*size, ptr1+4*(size+1))` to `_p1`, which leads to buffer-overflow.

**Since it causes both buffer-overflow read (`ptr1`) and buffer-overflow write (`outptr`), it could lead to incorrect result of the model.**
```
{
      const float* ptr0 = bottom_blob.channel(channels_per_group);
      const float* ptr1 = bottom_blob.channel(channels_per_group * 2);
      float* outptr = top_blob.channel(channels_per_group * 2);

      ptr1 += 2;

      for (int i = 0; i < size; i++)
      {
          __m128 _p0 = _mm_loadu_ps(ptr0);
          __m128 _p1 = _mm_loadu_ps(ptr1);

          __m128 _lo = _mm_unpacklo_ps(_p0, _p1);

          _mm_storeu_ps(outptr, _lo);

          ptr0 += 4;
          ptr1 += 4;
          outptr += 4;
      }
  }
```

**x86**
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L117
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L373
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L608

**arm**
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L118
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L365
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L599

## how to reproduce
1. Build with SSE in x86 or arm
2. ./test_shufflechannel

## more 
I will open a PR of the patch for this:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow #5734

context

how to reproduce

more

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow #5734

Description

context

how to reproduce

more

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions