Support collective matmul optimization in mp #8855

yaochengji · 2025-03-19T17:27:41Z

This PR:

add the new environment variable ENABLE_COLLECTIVE_MATMUL_IN_MP to turn on the xla config which is required for collective matmul optimization on TPU
add channel_id and use_global_device_ids to xm.all_gather and xm.reduce_scatter, which is also required to enable collective matmul
add a utility function to find the best ring order for v5p x 8 and v6 x 8

tengyifei

question: do you think it's possible to test that collective matmul is actually triggered? maybe we could look for the "decomposed_reduce_scatter_while" etc signatures in the optimized HLO from the XLA dump?

yaochengji · 2025-03-19T18:35:05Z

question: do you think it's possible to test that collective matmul is actually triggered? maybe we could look for the "decomposed_reduce_scatter_while" etc signatures in the optimized HLO from the XLA dump?

Thanks for you review, Yifei! I need to wait for tomorrow's libtpu nightly to enable collective matmul.

tengyifei · 2025-03-19T21:10:17Z

I need to wait for tomorrow's libtpu nightly to enable collective matmul.

Gotcha. I think it's useful to add a test once we roll in a new libtpu. Otherwise it's hard to tell if this worked. LGTM otherwise.

yaochengji requested review from tengyifei, lsy323 and zpcore March 19, 2025 17:27

tengyifei requested changes Mar 19, 2025

View reviewed changes

yaochengji added 4 commits March 19, 2025 20:25

init

b3d9739

add tests

ed1f9f7

fix cpp format

5ffbd7a

format python

676f5b8

yaochengji force-pushed the chengji/enable-cm branch from 150779b to 676f5b8 Compare March 19, 2025 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support collective matmul optimization in mp #8855

Support collective matmul optimization in mp #8855

yaochengji commented Mar 19, 2025

tengyifei left a comment

yaochengji commented Mar 19, 2025

tengyifei commented Mar 19, 2025

Support collective matmul optimization in mp #8855

Are you sure you want to change the base?

Support collective matmul optimization in mp #8855

Conversation

yaochengji commented Mar 19, 2025

tengyifei left a comment

Choose a reason for hiding this comment

yaochengji commented Mar 19, 2025

tengyifei commented Mar 19, 2025