ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill

# ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill

## Summary

When compiling code that uses `aie::mmul<8,8,8>` with `-DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16`, the compiler:
- **Crashes at `-O0`** with assertion failure in `AIE2PInstructionSelector::selectG_AIE_LOAD_STORE`
- **Generates wrong code at `-O1` and `-O2`** — element 0 of each 8-element group in the output is systematically corrupted

The bug is triggered when 9 BFP16-emulated MAC calls use `shuffle_up_fill`/`shuffle_down_fill` on activation vectors loaded from 3 different source pointers, with all 9 weight vectors pre-loaded.

## Compiler version

```
clang version 20.0.0 (https://github.com/Xilinx/llvm-aie b169186bd072791e958d65deb239f2ac384c30c2)
Target: aie2p-none-unknown-elf
```

## Minimal reproducer

```cpp
#define NOCPP
#include <stdint.h>
#include <aie_api/aie.hpp>

using MMUL = aie::mmul<8, 8, 8, bfloat16, bfloat16>;

extern "C" {
void repro(
    bfloat16 *src0, bfloat16 *src1, bfloat16 *src2,
    bfloat16 *wts, bfloat16 *dst,
    int32_t n
) {
    auto zvec = aie::zeros<bfloat16, 64>();

    auto w0 = aie::load_v<64>(wts);
    auto w1 = aie::load_v<64>(wts + 64);
    auto w2 = aie::load_v<64>(wts + 128);
    auto w3 = aie::load_v<64>(wts + 192);
    auto w4 = aie::load_v<64>(wts + 256);
    auto w5 = aie::load_v<64>(wts + 320);
    auto w6 = aie::load_v<64>(wts + 384);
    auto w7 = aie::load_v<64>(wts + 448);
    auto w8 = aie::load_v<64>(wts + 512);

    for (int i = 1; i < n - 1; i++) {
        int off = i * 64;
        MMUL acc;

        auto a = aie::load_v<64>(src0 + off - 64);
        auto b = aie::load_v<64>(src0 + off);
        auto c = aie::load_v<64>(src0 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w0);
        acc.mac(b, w1);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w2);

        a = aie::load_v<64>(src1 + off - 64);
        b = aie::load_v<64>(src1 + off);
        c = aie::load_v<64>(src1 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w3);
        acc.mac(b, w4);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w5);

        a = aie::load_v<64>(src2 + off - 64);
        b = aie::load_v<64>(src2 + off);
        c = aie::load_v<64>(src2 + off + 64);
        acc.mac(aie::shuffle_up_fill(b, a, 8), w6);
        acc.mac(b, w7);
        acc.mac(aie::shuffle_down_fill(b, c, 8), w8);

        aie::store_v(dst + off, acc.template to_vector<bfloat16>());
    }
}
} // extern "C"
```

```bash
clang++ -O0 -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 -std=c++20 \
    --target=aie2p-none-unknown-elf \
    -I $(python3 -c "from aie.utils.config import root_path; print(root_path())")/include \
    -c repro.cc -o repro.o
```

## Crash backtrace (-O0)

```
clang++: llvm/include/llvm/CodeGen/MachineOperand.h:370:
    llvm::Register llvm::MachineOperand::getReg() const:
    Assertion `isReg() && "This is not a register operand!"' failed.

 #9 (anonymous namespace)::AIE2PInstructionSelector::selectG_AIE_LOAD_STORE(...)
#10 llvm::InstructionSelect::selectMachineFunction(...)
#11 llvm::InstructionSelect::runOnMachineFunction(...)
```

## Wrong code behavior (-O1, -O2)

At `-O1` and `-O2`, the code compiles but produces incorrect results on hardware (Strix Point / XDNA2). Element 0 of each 8-element output group has a systematic positive offset. Elements 1-7 are correct.

With identity weight matrices and uniform input, the error on element 0 is exactly `+2.34375` at every output position.

## Isolation

Systematic testing narrows the trigger to the combination of high register pressure + shuffles + BFP16 emulated MACs:

| Test | MACs | Shuffles | Source ptrs | Pre-loaded weights | Result |
|------|------|----------|-------------|-------------------|--------|
| 1 MAC, no shuffle | 1 | No | 1 | 1 | **PASS** |
| 9 MACs, no shuffle, same src | 9 | No | 1 | 9 | **PASS** |
| 3 MACs, shuffles, 1 src | 3 | Yes | 1 | 3 | **PASS** |
| 6 MACs, shuffles, 2 srcs | 6 | Yes | 2 | 6 | **PASS** |
| **9 MACs, shuffles, 3 srcs** | **9** | **Yes** | **3** | **9** | **ICE at -O0 / wrong code at -O1,-O2** |

**Workaround**: Loading weights inside a per-source loop (3 at a time) instead of pre-loading all 9 produces correct results.

The same code pattern with `mmul<4,8,8>` + BFP16 emulation compiles and runs correctly at all optimization levels.

## Attached files

- `repro-preprocessed.cpp` — preprocessed source (generated by compiler crash handler)
- `repro-script.sh` — reproduction script (generated by compiler crash handler)


[conv3x3_fp16_9shuf_o0-005964.sh](https://github.com/user-attachments/files/25986725/conv3x3_fp16_9shuf_o0-005964.sh)





[conv3x3_fp16_9shuf_o0-005964.zip](https://github.com/user-attachments/files/25986781/conv3x3_fp16_9shuf_o0-005964.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill #847

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill

Summary

Compiler version

Minimal reproducer

Crash backtrace (-O0)

Wrong code behavior (-O1, -O2)

Isolation

Attached files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	MACs	Shuffles	Source ptrs	Pre-loaded weights	Result
1 MAC, no shuffle	1	No	1	1	PASS
9 MACs, no shuffle, same src	9	No	1	9	PASS
3 MACs, shuffles, 1 src	3	Yes	1	3	PASS
6 MACs, shuffles, 2 srcs	6	Yes	2	6	PASS
9 MACs, shuffles, 3 srcs	9	Yes	3	9	ICE at -O0 / wrong code at -O1,-O2

Uh oh!

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill #847

Description

ICE at -O0 / wrong code at -O1,-O2 with BFP16-emulated MMUL<8,8,8> and shuffle_up_fill

Summary

Compiler version

Minimal reproducer

Crash backtrace (-O0)

Wrong code behavior (-O1, -O2)

Isolation

Attached files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions