Skip to content

Attention regression in ToM compared to MLPerf branch #107

@MaheshRavishankar

Description

@MaheshRavishankar

For reproduction.

Input Model:
https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet.mlir

Input data :
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.0.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.1.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.2.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.3.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.4.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.5.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet_weights.irpa

I built IREE on main and used the TD script in https://github.com/nod-ai/sdxl-scripts/blob/shared/sdxl_on_main/int8-model/specs/attention_and_matmul_spec.mlir

Compilation command for IREE on main

iree-compile \
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-hip-target=gfx942 \
    --iree-hip-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library= \
    --iree-dispatch-creation-enable-aggressive-fusion=true \
    --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \                                                                                                                                                                   ${PUNET_MODEL} \
    -o ${VMFB} \                                                               

Run Command :

iree-benchmark-module \
    --device=hip:0 \
    --device_allocator=caching \
    --function=main \
    --hip_allow_inline_execution=true \
    --hip_use_stream=true \
    --input=1x4x128x128xf16=@inference_input.0.bin \
    --input=1xf16=@inference_input.1.bin \
    --input=2x64x2048xf16=@inference_input.2.bin \
    --input=2x1280xf16=@inference_input.3.bin \
    --input=2x6xf16=@inference_input.4.bin \
    --input=1xf16=@inference_input.5.bin \
    --module=${VMFB} \
    --parameters=model=punet_weights.irpa 

For compilation on MLPerf I used the same inputs/weights but used
IREE Commit : https://github.com/iree-org/iree/tree/mlperf_v4.1_20240726
TD script : https://github.com/nod-ai/sdxl-scripts/blob/mlperf_v4.1_20240726/int8-model/specs/attention_and_matmul_spec.mlir

iree-compile
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-rocm-target-chip=gfx942 \
    --iree-rocm-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library=${TD_SPEC} \
    --iree-flow-enable-aggressive-fusion=true \                                                                                                                                                                      --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  util.func(iree-preprocessing-pa\d-to-intrinsics), util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \
    ${PUNET_MODEL} \
    -o ${VMFB} \

and same run command

The following dispatches regress

attention_48_* 41ms -> 53 ms
attention_146_* 48 ms -> 56 ms

Below is IR dumps for MLPerf branch and ToM for the two attention dispatches.

sdxl_mlperf_attention_48.dump.mlir.txt
sdxl_mlperf_attention_146.dump.mlir.txt
sdxl_tom_attention_48.dump.mlir.txt
sdxl_tom_attention_146.dump.mlir.txt

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions