-
Notifications
You must be signed in to change notification settings - Fork 238
Description
We are seeing a GEMM template selection problem in LiteRT OpenCL for blockwise INT4 models on Adreno 830.
This is not mainly a numerical issue. The main regression appears to come from LiteRT selecting the wrong GEMM family for the large projection layers.
What we validated
For the main large GEMM shapes, standalone fast-path kernels can reach about 2.8–3.1 TOPS.
Representative shapes:
- 4224 x 3840 x 11520
- 4224 x 3840 x 3840
- 4224 x 3840 x 10240
- 4224 x 10240 x 3840
These fast results correspond to the previously validated os=8 GEMM family.
What LiteRT is selecting instead
From intercepted OpenCL sources used by the current LiteRT blockwise INT4 run, the large GEMMs are not landing on the os=8 family.
The selected kernels are consistent with an os=4 family:
- if (Z * 4 >= ...)
- coord_s = mul24(Z, 4)
- accumulators r0..r3
- qcom_max_concurrent_subgroups(24)
- qcom_sub_group_constant_load4(..., 16)
- float4 / read_imagef / write_imagef
By contrast, the fast validated family is os=8:
- if (Z * 8 >= ...)
- coord_s = mul24(Z, 8)
- accumulators r0..r7
- qcom_max_concurrent_subgroups(12)
- qcom_sub_group_constant_load8(..., 32)
So LiteRT appears to be selecting the wrong GEMM template family.
Observed effect
Under the current LiteRT execution path, effective throughput for the same large GEMMs drops far below the standalone fast-path results:
- qkv: about 0.70 TOPS
- o_proj: about 0.71 TOPS
- w2: about 0.53 TOPS
- w1/w3-class dispatches: about 0.70–1.5 TOPS
That is much lower than the expected ~3 TOPS range.
Request
Please check the OpenCL blockwise-INT4 GEMM selector:
- why LiteRT is choosing the os=4/sub24/load4 family here
- why it is not choosing the expected os=8/sub12/load8 fast path
- what shape/layout/epilogue conditions are gating that decision
The main issue is:
LiteRT is selecting the wrong GEMM template family for blockwise INT4, which causes the performance regression.
Additional request
This issue is difficult to debug from the outside because the OpenCL kernel selection and graph assembly logic are closed.
Please consider open-sourcing the OpenCL backend library, or at least the GEMM selector / kernel-template selection logic for the GPU backend.
That would make it much easier to:
- understand why LiteRT is selecting the os=4 family instead of the expected os=8 fast path
- reproduce and isolate regressions like this
- contribute fixes or validate alternative schedules on real workloads
At minimum, exposing the selector logic and template routing rules would already be very helpful.