Skip to content

SDXL clip encoder perf regression #37840

@jmitrovicTT

Description

@jmitrovicTT

Summary

After the commit Respect MM throttle level in minimal_matmul, the throttle is applied to minimal_matmul operation, which affects all Linear layers in the CLIP encoder since they run on a full 8×8 (64 cores) grid (above the 48-core throttle threshold for WH) and sdxl image gen is impacted ~8ms

Minimal matmul on 40 cores vs 64 cores experiment results

pytest models/experimental/stable_diffusion_xl_base/tests/test_sdxl_perf.py::test_sdxl_perf_device[test_sdxl_clip_encoder_1] models/experimental/stable_diffusion_xl_base/tests/test_sdxl_perf.py::test_sdxl_perf_device[test_sdxl_clip_encoder_2]

64 cores + throttle:

  • encoder_1: AVG DEVICE KERNEL DURATION [ns] 14663277.0 is outside of expected range (12915873.57, 13309250.429999998)
  • encoder_2: AVG DEVICE KERNEL DURATION [ns] 70378029.0 is outside of expected range (62319927.74, 64863598.26)

40 cores (no throttle ofc):

  • encoder_1: AVG DEVICE KERNEL DURATION [ns] 11939170.0 is outside of expected range (12915873.57, 13309250.429999998)
  • encoder_1: AVG DEVICE KERNEL DURATION [ns] 58080141.0 is outside of expected range (62319927.74, 64863598.26)

64 cores no throttle

  • encoder_1: "AVG DEVICE KERNEL DURATION [ns]": 13163609.0
  • encoder_2: "AVG DEVICE KERNEL DURATION [ns]": 63109927.0

Potential solutions

  • Use a smaller core grid (≤48 cores) for CLIP encoder matmuls to stay below the throttle threshold. Initial testing with 8×5 (40 cores) shows it actually outperforms the 64-core setup

  • Pass explicit throttle_level=0 via compute_kernel_config for CLIP encoder operations, since we haven't observed hangs without throttle on these shapes

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions