LiteRT OpenCL selects the wrong GEMM template for blockwise INT4 on Adreno, causing major performance regression

  We are seeing a GEMM template selection problem in LiteRT OpenCL for blockwise INT4 models on Adreno 830.

  This is not mainly a numerical issue. The main regression appears to come from LiteRT selecting the wrong GEMM family for the large projection layers.

  What we validated
  For the main large GEMM shapes, standalone fast-path kernels can reach about 2.8–3.1 TOPS.

  Representative shapes:

  - 4224 x 3840 x 11520
  - 4224 x 3840 x 3840
  - 4224 x 3840 x 10240
  - 4224 x 10240 x 3840

  These fast results correspond to the previously validated os=8 GEMM family.

  What LiteRT is selecting instead
  From intercepted OpenCL sources used by the current LiteRT blockwise INT4 run, the large GEMMs are not landing on the os=8 family.

  The selected kernels are consistent with an os=4 family:

  - if (Z * 4 >= ...)
  - coord_s = mul24(Z, 4)
  - accumulators r0..r3
  - qcom_max_concurrent_subgroups(24)
  - qcom_sub_group_constant_load4(..., 16)
  - float4 / read_imagef / write_imagef

  By contrast, the fast validated family is os=8:

  - if (Z * 8 >= ...)
  - coord_s = mul24(Z, 8)
  - accumulators r0..r7
  - qcom_max_concurrent_subgroups(12)
  - qcom_sub_group_constant_load8(..., 32)

  So LiteRT appears to be selecting the wrong GEMM template family.

  Observed effect
  Under the current LiteRT execution path, effective throughput for the same large GEMMs drops far below the standalone fast-path results:

  - qkv: about 0.70 TOPS
  - o_proj: about 0.71 TOPS
  - w2: about 0.53 TOPS
  - w1/w3-class dispatches: about 0.70–1.5 TOPS

  That is much lower than the expected ~3 TOPS range.

  Request
  Please check the OpenCL blockwise-INT4 GEMM selector:

  - why LiteRT is choosing the os=4/sub24/load4 family here
  - why it is not choosing the expected os=8/sub12/load8 fast path
  - what shape/layout/epilogue conditions are gating that decision

  The main issue is:
  LiteRT is selecting the wrong GEMM template family for blockwise INT4, which causes the performance regression.
  Additional request
  This issue is difficult to debug from the outside because the OpenCL kernel selection and graph assembly logic are closed.

  Please consider open-sourcing the OpenCL backend library, or at least the GEMM selector / kernel-template selection logic for the GPU backend.

  That would make it much easier to:

  - understand why LiteRT is selecting the os=4 family instead of the expected os=8 fast path
  - reproduce and isolate regressions like this
  - contribute fixes or validate alternative schedules on real workloads

  At minimum, exposing the selector logic and template routing rules would already be very helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiteRT OpenCL selects the wrong GEMM template for blockwise INT4 on Adreno, causing major performance regression #6399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LiteRT OpenCL selects the wrong GEMM template for blockwise INT4 on Adreno, causing major performance regression #6399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions