Open
Description
given this C code
https://godbolt.org/z/YMo1qqccT
#include <stdbool.h>
#include <wasm_simd128.h>
bool foo(v128_t a) { return wasm_i8x16_all_true(a); }
bool bar(v128_t a) {
v128_t zero = wasm_i8x16_splat(0);
return __builtin_reduce_and(wasm_i8x16_ne(a, zero));
}
bool baz(v128_t a) {
v128_t zero = wasm_i8x16_splat(0);
return __builtin_reduce_and((a != zero));
}
I'd expect these all to optimize to
foo:
local.get 0
i8x16.all_true
end_function
or some variation in it. However, the other variants optimize much worse.
bar:
local.get 0
v128.const 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
i8x16.ne
local.tee 0
local.get 0
local.get 0
i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
v128.and
local.tee 0
local.get 0
local.get 0
i8x16.shuffle 4, 5, 6, 7, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
v128.and
i32x4.extract_lane 0
i32.const 0
i32.ne
end_function
baz:
local.get 0
v128.const 0, 0, 0, 0
i32x4.eq
v128.any_true
i32.const -1
i32.xor
i32.const 1
i32.and
end_function
Binary size is especially important for wasm, and it looks like __builtin_reduce_and
just does not optimize well (I suspect the same is true for __builtin_reduce_or
).
s390x has the same limitation #129434, so maybe some work can be shared between backends?
This came up while working on the rust standard library, which would rather use the generic implementation of operations than a target-specific one.