Skip to content

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill #847

@RobinMelhuish

Description

@RobinMelhuish

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill

Summary

When compiling code that uses aie::mmul<8,8,8> with -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16, the compiler:

  • Crashes at -O0 with assertion failure in AIE2PInstructionSelector::selectG_AIE_LOAD_STORE
  • Generates wrong code at -O1 and -O2 — element 0 of each 8-element group in the output is systematically corrupted

The bug is triggered when 9 BFP16-emulated MAC calls use shuffle_up_fill/shuffle_down_fill on activation vectors loaded from 3 different source pointers, with all 9 weight vectors pre-loaded.

Compiler version

clang version 20.0.0 (https://github.com/Xilinx/llvm-aie b169186bd072791e958d65deb239f2ac384c30c2)
Target: aie2p-none-unknown-elf

Minimal reproducer

#define NOCPP
#include <stdint.h>
#include <aie_api/aie.hpp>

using MMUL = aie::mmul<8, 8, 8, bfloat16, bfloat16>;

extern "C" {
void repro(
    bfloat16 *src0, bfloat16 *src1, bfloat16 *src2,
    bfloat16 *wts, bfloat16 *dst,
    int32_t n
) {
    auto zvec = aie::zeros<bfloat16, 64>();

    auto w0 = aie::load_v<64>(wts);
    auto w1 = aie::load_v<64>(wts + 64);
    auto w2 = aie::load_v<64>(wts + 128);
    auto w3 = aie::load_v<64>(wts + 192);
    auto w4 = aie::load_v<64>(wts + 256);
    auto w5 = aie::load_v<64>(wts + 320);
    auto w6 = aie::load_v<64>(wts + 384);
    auto w7 = aie::load_v<64>(wts + 448);
    auto w8 = aie::load_v<64>(wts + 512);

    for (int i = 1; i < n - 1; i++) {
        int off = i * 64;
        MMUL acc;

        auto a = aie::load_v<64>(src0 + off - 64);
        auto b = aie::load_v<64>(src0 + off);
        auto c = aie::load_v<64>(src0 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w0);
        acc.mac(b, w1);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w2);

        a = aie::load_v<64>(src1 + off - 64);
        b = aie::load_v<64>(src1 + off);
        c = aie::load_v<64>(src1 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w3);
        acc.mac(b, w4);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w5);

        a = aie::load_v<64>(src2 + off - 64);
        b = aie::load_v<64>(src2 + off);
        c = aie::load_v<64>(src2 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w6);
        acc.mac(b, w7);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w8);

        aie::store_v(dst + off, acc.template to_vector<bfloat16>());
    }
}
} // extern "C"
clang++ -O0 -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 -std=c++20 \
    --target=aie2p-none-unknown-elf \
    -I $(python3 -c "from aie.utils.config import root_path; print(root_path())")/include \
    -c repro.cc -o repro.o

Crash backtrace (-O0)

clang++: llvm/include/llvm/CodeGen/MachineOperand.h:370:
    llvm::Register llvm::MachineOperand::getReg() const:
    Assertion `isReg() && "This is not a register operand!"' failed.

 #9 (anonymous namespace)::AIE2PInstructionSelector::selectG_AIE_LOAD_STORE(...)
#10 llvm::InstructionSelect::selectMachineFunction(...)
#11 llvm::InstructionSelect::runOnMachineFunction(...)

Wrong code behavior (-O1, -O2)

At -O1 and -O2, the code compiles but produces incorrect results on hardware (Strix Point / XDNA2). Element 0 of each 8-element output group has a systematic positive offset. Elements 1-7 are correct.

With identity weight matrices and uniform input, the error on element 0 is exactly +2.34375 at every output position.

Isolation

Systematic testing narrows the trigger to the combination of high register pressure + shuffles + BFP16 emulated MACs:

Test MACs Shuffles Source ptrs Pre-loaded weights Result
1 MAC, no shuffle 1 No 1 1 PASS
9 MACs, no shuffle, same src 9 No 1 9 PASS
3 MACs, shuffles, 1 src 3 Yes 1 3 PASS
6 MACs, shuffles, 2 srcs 6 Yes 2 6 PASS
9 MACs, shuffles, 3 srcs 9 Yes 3 9 ICE at -O0 / wrong code at -O1,-O2

Workaround: Loading weights inside a per-source loop (3 at a time) instead of pre-loading all 9 produces correct results.

The same code pattern with mmul<4,8,8> + BFP16 emulation compiles and runs correctly at all optimization levels.

Attached files

  • repro-preprocessed.cpp — preprocessed source (generated by compiler crash handler)
  • repro-script.sh — reproduction script (generated by compiler crash handler)

conv3x3_fp16_9shuf_o0-005964.sh

conv3x3_fp16_9shuf_o0-005964.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions