ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill
Summary
When compiling code that uses aie::mmul<8,8,8> with -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16, the compiler:
- Crashes at
-O0 with assertion failure in AIE2PInstructionSelector::selectG_AIE_LOAD_STORE
- Generates wrong code at
-O1 and -O2 — element 0 of each 8-element group in the output is systematically corrupted
The bug is triggered when 9 BFP16-emulated MAC calls use shuffle_up_fill/shuffle_down_fill on activation vectors loaded from 3 different source pointers, with all 9 weight vectors pre-loaded.
Compiler version
clang version 20.0.0 (https://github.com/Xilinx/llvm-aie b169186bd072791e958d65deb239f2ac384c30c2)
Target: aie2p-none-unknown-elf
Minimal reproducer
#define NOCPP
#include <stdint.h>
#include <aie_api/aie.hpp>
using MMUL = aie::mmul<8, 8, 8, bfloat16, bfloat16>;
extern "C" {
void repro(
bfloat16 *src0, bfloat16 *src1, bfloat16 *src2,
bfloat16 *wts, bfloat16 *dst,
int32_t n
) {
auto zvec = aie::zeros<bfloat16, 64>();
auto w0 = aie::load_v<64>(wts);
auto w1 = aie::load_v<64>(wts + 64);
auto w2 = aie::load_v<64>(wts + 128);
auto w3 = aie::load_v<64>(wts + 192);
auto w4 = aie::load_v<64>(wts + 256);
auto w5 = aie::load_v<64>(wts + 320);
auto w6 = aie::load_v<64>(wts + 384);
auto w7 = aie::load_v<64>(wts + 448);
auto w8 = aie::load_v<64>(wts + 512);
for (int i = 1; i < n - 1; i++) {
int off = i * 64;
MMUL acc;
auto a = aie::load_v<64>(src0 + off - 64);
auto b = aie::load_v<64>(src0 + off);
auto c = aie::load_v<64>(src0 + off + 64);
acc.mac(aie::shuffle_up_fill(b, a, 8), w0);
acc.mac(b, w1);
acc.mac(aie::shuffle_down_fill(b, c, 8), w2);
a = aie::load_v<64>(src1 + off - 64);
b = aie::load_v<64>(src1 + off);
c = aie::load_v<64>(src1 + off + 64);
acc.mac(aie::shuffle_up_fill(b, a, 8), w3);
acc.mac(b, w4);
acc.mac(aie::shuffle_down_fill(b, c, 8), w5);
a = aie::load_v<64>(src2 + off - 64);
b = aie::load_v<64>(src2 + off);
c = aie::load_v<64>(src2 + off + 64);
acc.mac(aie::shuffle_up_fill(b, a, 8), w6);
acc.mac(b, w7);
acc.mac(aie::shuffle_down_fill(b, c, 8), w8);
aie::store_v(dst + off, acc.template to_vector<bfloat16>());
}
}
} // extern "C"
clang++ -O0 -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 -std=c++20 \
--target=aie2p-none-unknown-elf \
-I $(python3 -c "from aie.utils.config import root_path; print(root_path())")/include \
-c repro.cc -o repro.o
Crash backtrace (-O0)
clang++: llvm/include/llvm/CodeGen/MachineOperand.h:370:
llvm::Register llvm::MachineOperand::getReg() const:
Assertion `isReg() && "This is not a register operand!"' failed.
#9 (anonymous namespace)::AIE2PInstructionSelector::selectG_AIE_LOAD_STORE(...)
#10 llvm::InstructionSelect::selectMachineFunction(...)
#11 llvm::InstructionSelect::runOnMachineFunction(...)
Wrong code behavior (-O1, -O2)
At -O1 and -O2, the code compiles but produces incorrect results on hardware (Strix Point / XDNA2). Element 0 of each 8-element output group has a systematic positive offset. Elements 1-7 are correct.
With identity weight matrices and uniform input, the error on element 0 is exactly +2.34375 at every output position.
Isolation
Systematic testing narrows the trigger to the combination of high register pressure + shuffles + BFP16 emulated MACs:
| Test |
MACs |
Shuffles |
Source ptrs |
Pre-loaded weights |
Result |
| 1 MAC, no shuffle |
1 |
No |
1 |
1 |
PASS |
| 9 MACs, no shuffle, same src |
9 |
No |
1 |
9 |
PASS |
| 3 MACs, shuffles, 1 src |
3 |
Yes |
1 |
3 |
PASS |
| 6 MACs, shuffles, 2 srcs |
6 |
Yes |
2 |
6 |
PASS |
| 9 MACs, shuffles, 3 srcs |
9 |
Yes |
3 |
9 |
ICE at -O0 / wrong code at -O1,-O2 |
Workaround: Loading weights inside a per-source loop (3 at a time) instead of pre-loading all 9 produces correct results.
The same code pattern with mmul<4,8,8> + BFP16 emulation compiles and runs correctly at all optimization levels.
Attached files
repro-preprocessed.cpp — preprocessed source (generated by compiler crash handler)
repro-script.sh — reproduction script (generated by compiler crash handler)
conv3x3_fp16_9shuf_o0-005964.sh
conv3x3_fp16_9shuf_o0-005964.zip
ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill
Summary
When compiling code that uses
aie::mmul<8,8,8>with-DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16, the compiler:-O0with assertion failure inAIE2PInstructionSelector::selectG_AIE_LOAD_STORE-O1and-O2— element 0 of each 8-element group in the output is systematically corruptedThe bug is triggered when 9 BFP16-emulated MAC calls use
shuffle_up_fill/shuffle_down_fillon activation vectors loaded from 3 different source pointers, with all 9 weight vectors pre-loaded.Compiler version
Minimal reproducer
clang++ -O0 -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 -std=c++20 \ --target=aie2p-none-unknown-elf \ -I $(python3 -c "from aie.utils.config import root_path; print(root_path())")/include \ -c repro.cc -o repro.oCrash backtrace (-O0)
Wrong code behavior (-O1, -O2)
At
-O1and-O2, the code compiles but produces incorrect results on hardware (Strix Point / XDNA2). Element 0 of each 8-element output group has a systematic positive offset. Elements 1-7 are correct.With identity weight matrices and uniform input, the error on element 0 is exactly
+2.34375at every output position.Isolation
Systematic testing narrows the trigger to the combination of high register pressure + shuffles + BFP16 emulated MACs:
Workaround: Loading weights inside a per-source loop (3 at a time) instead of pre-loading all 9 produces correct results.
The same code pattern with
mmul<4,8,8>+ BFP16 emulation compiles and runs correctly at all optimization levels.Attached files
repro-preprocessed.cpp— preprocessed source (generated by compiler crash handler)repro-script.sh— reproduction script (generated by compiler crash handler)conv3x3_fp16_9shuf_o0-005964.sh
conv3x3_fp16_9shuf_o0-005964.zip