-
Notifications
You must be signed in to change notification settings - Fork 347
Open
Labels
Description
Summary
After the commit Respect MM throttle level in minimal_matmul, the throttle is applied to minimal_matmul operation, which affects all Linear layers in the CLIP encoder since they run on a full 8×8 (64 cores) grid (above the 48-core throttle threshold for WH) and sdxl image gen is impacted ~8ms
Minimal matmul on 40 cores vs 64 cores experiment results
pytest models/experimental/stable_diffusion_xl_base/tests/test_sdxl_perf.py::test_sdxl_perf_device[test_sdxl_clip_encoder_1] models/experimental/stable_diffusion_xl_base/tests/test_sdxl_perf.py::test_sdxl_perf_device[test_sdxl_clip_encoder_2]
64 cores + throttle:
- encoder_1: AVG DEVICE KERNEL DURATION [ns] 14663277.0 is outside of expected range (12915873.57, 13309250.429999998)
- encoder_2: AVG DEVICE KERNEL DURATION [ns] 70378029.0 is outside of expected range (62319927.74, 64863598.26)
40 cores (no throttle ofc):
- encoder_1: AVG DEVICE KERNEL DURATION [ns] 11939170.0 is outside of expected range (12915873.57, 13309250.429999998)
- encoder_1: AVG DEVICE KERNEL DURATION [ns] 58080141.0 is outside of expected range (62319927.74, 64863598.26)
64 cores no throttle
- encoder_1: "AVG DEVICE KERNEL DURATION [ns]": 13163609.0
- encoder_2: "AVG DEVICE KERNEL DURATION [ns]": 63109927.0
Potential solutions
-
Use a smaller core grid (≤48 cores) for CLIP encoder matmuls to stay below the throttle threshold. Initial testing with 8×5 (40 cores) shows it actually outperforms the 64-core setup
-
Pass explicit throttle_level=0 via compute_kernel_config for CLIP encoder operations, since we haven't observed hangs without throttle on these shapes
Reactions are currently unavailable