[WIP] ArgCompare Benchmark Results and Performance Analysis #23609

bangtianliu · 2026-03-02T07:50:39Z

bangtianliu
Mar 2, 2026
Collaborator

Hardware: AMD MI300X (gfx942)

Refer to https://gist.github.com/bangtianliu/256ad601139f50300fd8c6aa2125eb20 for our previous synthetic benchmarking on ArgCompare.

ArgMax/ArgMin Operations Analysis in real-world AI models

The existing VD pipeline still needs mask support for arbitrary reduction sizes. We use power-of-2 padding as a valid workaround to evaluate the performance of our VD implementation for ArgCompare.

Model	Op	Original Shape	Padded Shape	Reduction	In	Out
LLaMA-2-7B	argmax	[1, 32000]	[1, 32768]	32,768	f16	i64
LLaMA-2-7B (batch)	argmax	[8, 32000]	[8, 32768]	32,768	f16	i64
LLaMA-3-8B	argmax	[1, 128000]	[1, 131072]	131,072	bf16	i64
LLaMA-3-8B (batch)	argmax	[32, 128000]	[32, 131072]	131,072	bf16	i64
GPT-2	argmax	[1, 50257]	[1, 65536]	65,536	f32	i64
Mistral-7B	argmax	[1, 32000]	[1, 32768]	32,768	bf16	i64
Gemma-7B	argmax	[1, 256000]	[1, 262144]	262,144	bf16	i64
Qwen-2-7B	argmax	[1, 152064]	[1, 262144]	262,144	bf16	i64
Falcon-7B	argmax	[1, 65024]	[1, 65536]	65,536	bf16	i64
Bloom-7B	argmax	[1, 250880]	[1, 262144]	262,144	bf16	i64
BERT-Base	argmax	[8, 512, 30522]	[8, 512, 32768]	32,768	f32	i64
ResNet-50 ImageNet	argmax	[32, 1000]	[32, 1024]	1,024	f32	i64
Whisper-Large	argmax	[1, 51865]	[1, 65536]	65,536	f16	i64
RecSys 100K	argmax	[256, 100000]	[256, 131072]	131,072	f32	i64
RecSys 1M	argmax	[64, 1000000]	[64, 1048576]	1,048,576	f32	i64
K-Nearest Neighbors	argmin	[1000, 50000]	[1024, 65536]	65,536	f32	i64
Vector Quantization	argmin	[1, 4096, 8192]	[1, 4096, 8192]	8,192	f32	i64
Image Retrieval	argmin	[32, 1000000]	[32, 1048576]	1,048,576	f32	i64

Refer to Real-World ArgMax/ArgMin Operations Analysis for detailed info.

Implementations Compared

Implementation	Description
VD	IREE's `LLVMGPUVectorDistribute` pipeline for ArgCompare. Uses subgroup shuffles and DPP (Data-Parallel Primitives) for efficient cross-lane reductions on AMD GPUs. Single workgroup processes entire reduction with ROCDL ballot optimizations.
VD_Split	VD with split reduction enabled. Tiles large reductions into smaller chunks processed by multiple workgroups in parallel, then merges results. Critical for reductions ≥131K elements.
hipCUB	AMD's GPU primitive library (port of NVIDIA CUB). Uses `DeviceReduce::ArgMax/ArgMin` for single-batch and `DeviceSegmentedReduce` for batched reductions. Highly optimized with adaptive algorithms (1-pass vs 2-pass based on reduction size).
Ukernel	IREE's microkernel path using `LLVMGPUDefault` pipeline with `linalg.generic` lowered to pre-compiled bitcode (`iree_uk_amdgpu_argmax_f32i64.gfx942.bc`). Compiled with `--iree-rocm-enable-ukernels=all`. Uses workgroup_size=[64,1,1] with reduction tile size of 64.
PyTorch	`torch.argmax`/`torch.argmin` via ROCm backend. Uses PyTorch's TensorIterator-based GPU reduction kernels (not hipCUB). On ROCm, CUDA kernels are auto-converted to HIP via HIPification.
CK	AMD Composable Kernel library. Template-based GPU kernels with manual optimizations. The `DeviceReduceMultiBlock` kernel is used.

Benchmark Results

Methodology: rocprof kernel timing
Warm-up: 100 runs, Benchmark: 500 runs averaged

VD_Split is enabled via compiler flags --iree-dispatch-creation-enable-split-reduction and --iree-preprocessing-pass-pipeline='builtin.module(iree-dispatch-creation-set-split-reduction-sizes{split-reduction-target-size=TILE})' for reductions ≥ 32K elements, which tiles the reduction into smaller chunks (e.g., 512, 1024, 2048) processed in parallel before a final merge.

Model	Shape	Dtype	Red	VD (μs)	VD_Split (μs)	hipCUB (μs)	Ukernel (μs)	PyTorch (μs)	CK (μs)	Winner
LLaMA-2-7B	(32768,)	f16	32K	3.77	3.21	6.33	55.24	12.72	2477.62	VD_Split
LLaMA-3-8B	(131072,)	bf16	131K	4.29	3.40	6.61	96.98	17.11	9888.99	VD_Split
Mistral-7B	(32768,)	bf16	32K	4.25	3.37	8.93	202.34	12.41	2477.06	VD_Split
Qwen-2-7B	(262144,)	bf16	262K	22.07	3.69	7.93	191.24	19.27	19843.92	VD_Split
Gemma-7B	(262144,)	bf16	262K	22.39	3.40	6.61	206.46	19.25	19840.07	VD_Split
Bloom-7B	(262144,)	bf16	262K	22.07	3.61	6.45	200.22	19.33	19840.91	VD_Split
Falcon-7B	(65536,)	bf16	65K	4.37	3.48	16.42	405.04	20.30	4960.50	VD_Split
GPT-2	(65536,)	f32	65K	4.33	3.20	10.86	177.74	18.90	4957.73	VD_Split
Whisper-Large	(65536,)	f16	65K	4.21	3.33	10.70	174.34	20.23	4956.77	VD_Split
LLaMA-2-7B (batch)	(8, 32768)	f16	32K	6.77	4.01	36.01	57.89	12.90	2477.91	VD_Split
LLaMA-3-8B (batch)	(32, 131072)	bf16	131K	15.02	6.29	129.55	105.31	86.55	9918.99	VD_Split
ResNet-50	(32, 1024)	f32	1K	5.53	-	4.05	6.45	6.33	78.80	hipCUB
BERT-Base	(8, 512, 32768)	f32	32K	165.84	148.14	146.09	171.73	171.19	2478.11	hipCUB
Image Retrieval	(32, 1048576)	f32	1M	62.29	37.38	582.06	740.98	207.52	80248.42	VD_Split
K-NN	(1024, 65536)	f32	65K	69.10	68.38	55.72	195.81	63.34	4953.01	hipCUB
RecSys 100K	(256, 131072)	f32	131K	40.34	39.14	75.55	131.63	249.72	9911.54	VD_Split
RecSys 1M	(64, 1048576)	f32	1M	73.95	73.31	581.54	847.25	244.41	79589.01	VD_Split
VQ Codebook	(1, 4096, 8192)	f32	8K	35.81	-	29.32	47.31	51.40	620.12	hipCUB

Summary:

VD/VD_Split is the fastest in 14/18 cases - outperforming hipCUB, PyTorch, Ukernel, and CK
VD excels at batched reductions: 5-20x faster than hipCUB for multi-batch workloads
Split reduction critical for 32K+ single reductions.
hipCUB wins in 4 edge cases: Small reductions (ResNet-50), high-batch-count with moderate reduction (K-NN, BERT-Base), and small reduction with large batch (VQ Codebook)

VectorDistribute Lowering Config

The following table shows the compiler's lowering configuration for each model when using LLVMGPUVectorDistribute pipeline.

Model	Shape	Dtype	WG Size	SG Size	Subgroups	Thread Loads	Partial Red	Elem/WG	Iterations
LLaMA-2-7B	(32,768)	f16	[1024,1,1]	64	16	8	8192	8192	4
LLaMA-3-8B	(131,072)	bf16	[1024,1,1]	64	16	8	8192	8192	16
Mistral-7B	(32,768)	bf16	[1024,1,1]	64	16	8	8192	8192	4
Qwen-2-7B	(262,144)	bf16	[1024,1,1]	64	16	8	8192	8192	32
Gemma-7B	(262,144)	bf16	[1024,1,1]	64	16	8	8192	8192	32
Bloom-7B	(262,144)	bf16	[1024,1,1]	64	16	8	8192	8192	32
Falcon-7B	(65,536)	bf16	[1024,1,1]	64	16	8	8192	8192	8
GPT-2	(65,536)	f32	[1024,1,1]	64	16	4	4096	4096	16
Whisper-Large	(65,536)	f16	[1024,1,1]	64	16	8	8192	8192	8
LLaMA-2-7B (batch)	(8x32,768)	f16	[1024,1,1]	64	16	8	8192	8192	4
LLaMA-3-8B (batch)	(32x131,072)	bf16	[1024,1,1]	64	16	8	8192	8192	16
ResNet-50	(32x1,024)	f32	[256,1,1]	64	4	4	1024	1024	1
BERT-Base	(4096x32,768)	f32	[64,1,1]	64	1	4	256	256	128
Image Retrieval	(32x1,048,576)	f32	[1024,1,1]	64	16	4	4096	4096	256
K-NN	(1024x65,536)	f32	[1024,1,1]	64	16	4	4096	4096	16
RecSys 100K	(256x131,072)	f32	[1024,1,1]	64	16	4	4096	4096	32
RecSys 1M	(64x1,048,576)	f32	[1024,1,1]	64	16	4	4096	4096	256
VQ Codebook	(4096x8,192)	f32	[64,1,1]	64	1	4	256	256	32

Column Definitions

WG Size: Workgroup size [x,y,z] - total threads per workgroup
SG Size: Subgroup (wavefront) size - 64 for AMD GCN/CDNA
Subgroups: Number of subgroups per workgroup (WG_x / SG_Size)
Thread Loads: Elements loaded per thread (128-bit vector / element size)
Partial Red: Elements reduced per iteration (Threads × Thread_Loads)
Elem/WG: Elements processed per workgroup per iteration
Iterations: Sequential iterations needed (Reduction / Partial_Red)

[WIP]: Add in-depth performance analysis and evaluate additional tools (e.g., MIOpen benchmarks).

kuhar · 2026-03-08T14:33:49Z

kuhar
Mar 8, 2026
Maintainer

Thanks for putting this together. Can you add units and element types to the table headers? I assume we are measuring microseconds?

What stands out to me is that the numbers in the VD column don’t scale with the input size — we are faster with 65k input elements than with 64. I’d expect that the upper half of the table is so small that we might be measuring noise, but as we select more than one subgroup, we should be paying the cost of doing workgroup-level reductions. Would be worth double checking the final mlir / llvm ir to make sure that this is the case, and look at these workloads under threadtrace.

It’s also worth checking what are the concrete problem sizes and data types in argmax ops from real-world models, and benchmark based off that.

0 replies

bangtianliu · 2026-03-19T20:01:52Z

bangtianliu
Mar 19, 2026
Collaborator Author

Tuning Update: Forced SG/PR Sweep

SG controls the number of subgroups per workgroup (workgroup_size = SG × 64; values: 1, 2, 4, 8, 16); PR controls the partial reduction tile size (values: 256, 512, 1024, 2048, 4096). 50+50 rocprof runs per config, around 67 unique configs via constraint-aware fine search.

Model	Shape	hipCUB (μs)	Default VD (μs)	Tuned VD (μs)	Best SG/PR	vs hipCUB
ResNet-50	32×1024	2.95	3.85	4.54	1/256	0.65x
VQ Codebook	4096×8192	30.86	35.49	17.98	16/1024	1.72x
K-NN	1024×65536	58.24	81.96	70.30	16/4096	0.83x
BERT-Base	4096×32768	150.63	174.42	174.05	16/4096	0.87x

Tuning Update: Forced SG/PR/ST Sweep

After enabling split reduction for K-NN and BERT-Base, ST controls the split reduction tile size (values: 256, 512, 1024, 2048)

Model	Shape	hipCUB (μs)	Default VD (μs)	Tuned VD (μs)	Best SG/PR/ST	vs hipCUB
K-NN	1024×65536	58.24	81.96	40.08	1/1024/1024	1.44x
BERT-Base	4096×32768	150.63	174.42	43.56	16/1024/2048	3.46x

Summary: After tuning sweep, VD/VD_Split wins 17/18 cases. The only remaining gap is ResNet-50 (32×1024) that requires further investigation.

Note: Preliminary results for tracking progress. Will update with final numbers and analysis.

0 replies

bangtianliu · 2026-03-25T15:45:02Z

bangtianliu
Mar 25, 2026
Collaborator Author

DPP+Ballot vs Shuffle-Only Comparison

Context: The real question here is whether shuffle-only approach (without DPP + ballot) is enough, which suggests that the performance difference may be minimal for memory-bound workloads. This comparison evaluates both approaches and guides the choice of implementation to support ArgCompare within the VectorDistribute pipeline.

Standard (No Split Reduction)

Model	Reduction	Dtype	DPP (μs)	Shuffle (μs)	Shuffle / DPP
llama2_7b	32768	f16	3.77	6.29	1.67x
llama3_8b	131072	bf16	4.29	13.70	3.19x
mistral_7b	32768	bf16	4.25	7.01	1.65x
qwen2_7b	262144	bf16	22.07	22.39	1.01x
gemma_7b	262144	bf16	22.39	22.07	0.99x
bloom_7b	262144	bf16	22.07	22.59	1.02x
falcon_7b	65536	bf16	4.37	8.93	2.04x
gpt2	65536	f32	4.33	8.13	1.88x
whisper_large	65536	f16	4.21	7.49	1.78x
llama2_batch8	8x32768	f16	6.77	7.37	1.09x
llama3_batch32	32x131072	bf16	15.02	14.66	0.98x
imagenet_1k	32x1024	f32	5.53	5.89	1.07x
bert_base	8x512x32768	f32	165.84	167.41	1.01x
image_retrieval	32x1048576	f32	62.29	61.77	0.99x
knn_50k	1024x65536	f32	69.10	69.74	1.01x
recsys_100k	256x131072	f32	40.34	42.82	1.06x
recsys_1m	64x1048576	f32	73.95	73.63	1.00x
vq_codebook	1x4096x8192	f32	35.81	36.01	1.01x

Without split: DPP matters only for 1D reductions 32K-131K (1.7-3.2x faster). For 262K and all batched/2D workloads, they're essentially identical (~1.0x).

Split Reduction

Model	Reduction	Dtype	DPP Split (μs)	Shuffle Split (μs)	Shuffle / DPP
llama2_7b	32768	f16	3.21	4.85	1.51x
llama3_8b	131072	bf16	3.41	4.97	1.46x
mistral_7b	32768	bf16	3.37	4.77	1.42x
qwen2_7b	262144	bf16	3.69	4.93	1.34x
gemma_7b	262144	bf16	3.41	4.85	1.42x
bloom_7b	262144	bf16	3.61	4.85	1.34x
falcon_7b	65536	bf16	3.49	4.49	1.29x
gpt2	65536	f32	3.20	5.09	1.59x
whisper_large	65536	f16	3.33	4.93	1.48x
llama2_batch8	8x32768	f16	4.01	4.89	1.22x
llama3_batch32	32x131072	bf16	6.29	6.37	1.01x
bert_base	8x512x32768	f32	148.14	179.55	1.21x
image_retrieval	32x1048576	f32	37.38	42.30	1.13x
knn_50k	1024x65536	f32	68.38	69.46	1.02x
recsys_100k	256x131072	f32	39.14	42.42	1.08x
recsys_1m	64x1048576	f32	73.31	80.72	1.10x

(imagenet_1k and vq_codebook excluded due to the reduction dimension being smaller than the 32768 threshold)

With split: DPP is consistently 1.0-1.6x faster. The gap is uniform across all sizes because split reduction normalizes the per-tile work.

Summary: Based on the performance results above, we should upstream a gpu.ballot operation to the GPU dialect, which can then be lowered to target-specific implementations (rocdl.ballot for AMD GPU, equivalent ops for SPIR-V and NVVM)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ArgCompare Benchmark Results and Performance Analysis #23609

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[WIP] ArgCompare Benchmark Results and Performance Analysis #23609

Uh oh!

Uh oh!

bangtianliu Mar 2, 2026 Collaborator

ArgMax/ArgMin Operations Analysis in real-world AI models

Implementations Compared

Benchmark Results

VectorDistribute Lowering Config

Column Definitions

Replies: 3 comments

Uh oh!

Uh oh!

kuhar Mar 8, 2026 Maintainer

Uh oh!

Uh oh!

bangtianliu Mar 19, 2026 Collaborator Author

Tuning Update: Forced SG/PR Sweep

Tuning Update: Forced SG/PR/ST Sweep

Uh oh!

Uh oh!

bangtianliu Mar 25, 2026 Collaborator Author

DPP+Ballot vs Shuffle-Only Comparison

Standard (No Split Reduction)

Split Reduction

bangtianliu
Mar 2, 2026
Collaborator

kuhar
Mar 8, 2026
Maintainer

bangtianliu
Mar 19, 2026
Collaborator Author

bangtianliu
Mar 25, 2026
Collaborator Author