Description
context
In SSE version of ShuffleChannel, the last channel is shuffled from the offset with the half of the granularity.
Here, it causes buffer-overflow at the load of the ptr1
at the last iteration of the for loop.
For example, in the case of (elempack == 4) && (_group == 2 && channels % _group != 0)
in AVX512 optimization,
The ptr1
initially can be accessed for the range [ptr1, ptr1+4*size)
,
and the range is reduced into [ptr1, ptr1+4*size-2)
after ptr1 += 2;
.
However, at the last iteration of the for loop, it loads [ptr1+4*size, ptr1+4*(size+1))
to _p1
, which leads to buffer-overflow.
Since it causes both buffer-overflow read (ptr1
) and buffer-overflow write (outptr
), it could lead to incorrect result of the model.
{
const float* ptr0 = bottom_blob.channel(channels_per_group);
const float* ptr1 = bottom_blob.channel(channels_per_group * 2);
float* outptr = top_blob.channel(channels_per_group * 2);
ptr1 += 2;
for (int i = 0; i < size; i++)
{
__m128 _p0 = _mm_loadu_ps(ptr0);
__m128 _p1 = _mm_loadu_ps(ptr1);
__m128 _lo = _mm_unpacklo_ps(_p0, _p1);
_mm_storeu_ps(outptr, _lo);
ptr0 += 4;
ptr1 += 4;
outptr += 4;
}
}
x86
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L117
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L373
https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L608
arm
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L118
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L365
https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L599
how to reproduce
- Build with SSE in x86 or arm
- ./test_shufflechannel
more
I will open a PR of the patch for this:)