Description
When performing activation equalization, we have a co-optimize
flag that is enabled for llm examples.
The idea is the following:
During activation equalization with FX, the equalization scaling factors are merged in the previous and subsequent layer.
Although the subsequent layer contributes to the computation of the scale factor, the previous one has no impact on the final value of the scales. This might result in the previous layer being highly disrupted by the equalization process.
To alleviate this, the co-optimize flag allows the previous layer's contributions to be weighted in during the smoothquant computation.
This is very experimental and might require a more in-depth analysis.
To reproduce, quantize OPT-125m with activation equalization (FX), with/without the co-optimize flag.