Hi,
first of all thank you for the very interesting paper!
I believe there is a mistake in the code to calculate the FLOPs for Mixtral. Mixtral is a 8x7b MoE model with 2 active experts.
Therefore, in the forward pass, only 1/4 of the non-embedding parameters are actually used for computation, yet the model_size dict (which is used in all flop calculations) contains the full number of ~47B tokens:
|
"mixtral_8x7b": 46702792704, |
This likely leads to inflated FLOP counts for results using the MoE model (e.g., PAIR, AutoDAN).
Best
Tim
Hi,
first of all thank you for the very interesting paper!
I believe there is a mistake in the code to calculate the FLOPs for Mixtral. Mixtral is a 8x7b MoE model with 2 active experts.
Therefore, in the forward pass, only 1/4 of the non-embedding parameters are actually used for computation, yet the
model_sizedict (which is used in all flop calculations) contains the full number of ~47B tokens:llm-threat-model/aggregate_csv.py
Line 12 in 3de6c6a
This likely leads to inflated FLOP counts for results using the MoE model (e.g., PAIR, AutoDAN).
Best
Tim