Attention regression in ToM compared to MLPerf branch

For reproduction. 

Input Model:
https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet.mlir

Input data :
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.0.bin
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.1.bin
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.2.bin
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.3.bin
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.4.bin
 wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.5.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet_weights.irpa 


I built IREE on main and used the TD script in https://github.com/nod-ai/sdxl-scripts/blob/shared/sdxl_on_main/int8-model/specs/attention_and_matmul_spec.mlir

Compilation command for IREE on main
```
iree-compile \
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-hip-target=gfx942 \
    --iree-hip-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library= \
    --iree-dispatch-creation-enable-aggressive-fusion=true \
    --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \                                                                                                                                                                   ${PUNET_MODEL} \
    -o ${VMFB} \                                                               
```
Run Command : 
```
iree-benchmark-module \
    --device=hip:0 \
    --device_allocator=caching \
    --function=main \
    --hip_allow_inline_execution=true \
    --hip_use_stream=true \
    --input=1x4x128x128xf16=@inference_input.0.bin \
    --input=1xf16=@inference_input.1.bin \
    --input=2x64x2048xf16=@inference_input.2.bin \
    --input=2x1280xf16=@inference_input.3.bin \
    --input=2x6xf16=@inference_input.4.bin \
    --input=1xf16=@inference_input.5.bin \
    --module=${VMFB} \
    --parameters=model=punet_weights.irpa 
```

For compilation on MLPerf I used the same inputs/weights but used 
IREE Commit : https://github.com/iree-org/iree/tree/mlperf_v4.1_20240726
TD script : https://github.com/nod-ai/sdxl-scripts/blob/mlperf_v4.1_20240726/int8-model/specs/attention_and_matmul_spec.mlir

```
iree-compile
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-rocm-target-chip=gfx942 \
    --iree-rocm-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library=${TD_SPEC} \
    --iree-flow-enable-aggressive-fusion=true \                                                                                                                                                                      --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  util.func(iree-preprocessing-pa\d-to-intrinsics), util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \
    ${PUNET_MODEL} \
    -o ${VMFB} \
```

and same run command

The following dispatches regress

attention_48_*   41ms -> 53 ms
attention_146_*  48 ms -> 56 ms

Below is IR dumps for MLPerf branch and ToM for the two attention dispatches.

[sdxl_mlperf_attention_48.dump.mlir.txt](https://github.com/user-attachments/files/17386718/sdxl_mlperf_attention_48.dump.mlir.txt)
[sdxl_mlperf_attention_146.dump.mlir.txt](https://github.com/user-attachments/files/17386719/sdxl_mlperf_attention_146.dump.mlir.txt)
[sdxl_tom_attention_48.dump.mlir.txt](https://github.com/user-attachments/files/17386720/sdxl_tom_attention_48.dump.mlir.txt)
[sdxl_tom_attention_146.dump.mlir.txt](https://github.com/user-attachments/files/17386721/sdxl_tom_attention_146.dump.mlir.txt)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention regression in ToM compared to MLPerf branch #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention regression in ToM compared to MLPerf branch #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions