Commit 7392aec
committed
fix: route per-channel FP8 MoE to CompressedTensorsFp8MoEMethod
Per-channel (per_Token) FP8 quantization needs per-channel weight scale
allocation [E, N, 1] which CompressedTensorsFp8MoEMethod provides.
Fp8MoEMethod only allocates scalar-per-expert scales [E, 2]/[E].
- Add dispatch case for quant_dtype==fp8 + quant_type==per_Token to use
CompressedTensorsFp8MoEMethod
- Fix _load_per_channel_weight_scale to unsqueeze 1D checkpoint scales
to match 2D [N, 1] buffer shape1 parent 097b7a8 commit 7392aec
1 file changed
+11
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1984 | 1984 | | |
1985 | 1985 | | |
1986 | 1986 | | |
| 1987 | + | |
| 1988 | + | |
| 1989 | + | |
| 1990 | + | |
| 1991 | + | |
| 1992 | + | |
| 1993 | + | |
1987 | 1994 | | |
1988 | 1995 | | |
1989 | 1996 | | |
| |||
2100 | 2107 | | |
2101 | 2108 | | |
2102 | 2109 | | |
| 2110 | + | |
| 2111 | + | |
| 2112 | + | |
| 2113 | + | |
2103 | 2114 | | |
2104 | 2115 | | |
2105 | 2116 | | |
| |||
0 commit comments