For reproduction.
Input Model:
https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet.mlir
Input data :
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.0.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.1.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.2.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.3.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.4.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.5.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet_weights.irpa
I built IREE on main and used the TD script in https://github.com/nod-ai/sdxl-scripts/blob/shared/sdxl_on_main/int8-model/specs/attention_and_matmul_spec.mlir
Compilation command for IREE on main
iree-compile \
--iree-execution-model=async-external \
--iree-hal-target-backends=rocm \
--iree-hip-target=gfx942 \
--iree-hip-waves-per-eu=2 \
--iree-codegen-gpu-native-math-precision=true \
--iree-codegen-llvmgpu-use-vector-distribution \
--iree-codegen-transform-dialect-library= \
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-global-opt-propagate-transposes=true \
--iree-llvmgpu-enable-prefetch=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-const-eval=false \
--iree-opt-outer-dim-concat=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
--iree-vm-target-truncate-unsupported-floats \ ${PUNET_MODEL} \
-o ${VMFB} \
Run Command :
iree-benchmark-module \
--device=hip:0 \
--device_allocator=caching \
--function=main \
--hip_allow_inline_execution=true \
--hip_use_stream=true \
--input=1x4x128x128xf16=@inference_input.0.bin \
--input=1xf16=@inference_input.1.bin \
--input=2x64x2048xf16=@inference_input.2.bin \
--input=2x1280xf16=@inference_input.3.bin \
--input=2x6xf16=@inference_input.4.bin \
--input=1xf16=@inference_input.5.bin \
--module=${VMFB} \
--parameters=model=punet_weights.irpa
For compilation on MLPerf I used the same inputs/weights but used
IREE Commit : https://github.com/iree-org/iree/tree/mlperf_v4.1_20240726
TD script : https://github.com/nod-ai/sdxl-scripts/blob/mlperf_v4.1_20240726/int8-model/specs/attention_and_matmul_spec.mlir
iree-compile
--iree-execution-model=async-external \
--iree-hal-target-backends=rocm \
--iree-rocm-target-chip=gfx942 \
--iree-rocm-waves-per-eu=2 \
--iree-codegen-gpu-native-math-precision=true \
--iree-codegen-llvmgpu-use-vector-distribution \
--iree-codegen-transform-dialect-library=${TD_SPEC} \
--iree-flow-enable-aggressive-fusion=true \ --iree-global-opt-propagate-transposes=true \
--iree-llvmgpu-enable-prefetch=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-const-eval=false \
--iree-opt-outer-dim-concat=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pa\d-to-intrinsics), util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
--iree-vm-target-truncate-unsupported-floats \
${PUNET_MODEL} \
-o ${VMFB} \
and same run command
The following dispatches regress
attention_48_* 41ms -> 53 ms
attention_146_* 48 ms -> 56 ms
Below is IR dumps for MLPerf branch and ToM for the two attention dispatches.
sdxl_mlperf_attention_48.dump.mlir.txt
sdxl_mlperf_attention_146.dump.mlir.txt
sdxl_tom_attention_48.dump.mlir.txt
sdxl_tom_attention_146.dump.mlir.txt
For reproduction.
Input Model:
https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet.mlir
Input data :
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.0.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.1.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.2.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.3.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.4.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.5.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet_weights.irpa
I built IREE on main and used the TD script in https://github.com/nod-ai/sdxl-scripts/blob/shared/sdxl_on_main/int8-model/specs/attention_and_matmul_spec.mlir
Compilation command for IREE on main
Run Command :
For compilation on MLPerf I used the same inputs/weights but used
IREE Commit : https://github.com/iree-org/iree/tree/mlperf_v4.1_20240726
TD script : https://github.com/nod-ai/sdxl-scripts/blob/mlperf_v4.1_20240726/int8-model/specs/attention_and_matmul_spec.mlir
and same run command
The following dispatches regress
attention_48_* 41ms -> 53 ms
attention_146_* 48 ms -> 56 ms
Below is IR dumps for MLPerf branch and ToM for the two attention dispatches.
sdxl_mlperf_attention_48.dump.mlir.txt
sdxl_mlperf_attention_146.dump.mlir.txt
sdxl_tom_attention_48.dump.mlir.txt
sdxl_tom_attention_146.dump.mlir.txt