-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
🚀 The feature, motivation and pitch
Currently, the int4_xpu_layout assumes that the int_data dim is 2, but in MoE case, the int_data dim should be 3. So when calling _convert_weight_to_int4pack, it will result into an assert error:
_convert_weight_to_int4pack_xpu : expect weight to be 2D tensor.Solution
Currently, the Int4XPULayout has the code like below:
See reference at:
https://github.com/pytorch/ao/blob/1493b15f65917477ce37abb94365b262cc3b1d95/torchao/dtypes/uintx/int4_xpu_layout.py#L255-L260
We need the logic like CUDA and implement the following:
def quant_2d(int_data_2d):
...
if int_data.dim()==3: # moe case
...
else: # normal case
...See CUDA Code at:
https://github.com/pytorch/ao/blob/418593c0e903f2b76072cc75a3010b3ef5396a20/torchao/dtypes/uintx/tensor_core_tiled_layout.py#L289
Affected Test Cases
Affected in total 4 Test cases:
cd ao/test/quantization
python -m pytest -sv -k test_int4wo test_moe_quant.py
FAILED test_moe_quant.py::TestMoEQuantCompile::test_int4wo_base_0_single_token
FAILED test_moe_quant.py::TestMoEQuantCompile::test_int4wo_base_1_multiple_tokens
FAILED test_moe_quant.py::TestMoEQuantCompile::test_int4wo_fake_dim_0_single_token
FAILED test_moe_quant.py::TestMoEQuantCompile::test_int4wo_fake_dim_1_multiple_tokens