Skip to content

Incorrect active parameter count for MoE models #1

@TimFelixBeyer

Description

@TimFelixBeyer

Hi,

first of all thank you for the very interesting paper!

I believe there is a mistake in the code to calculate the FLOPs for Mixtral. Mixtral is a 8x7b MoE model with 2 active experts.
Therefore, in the forward pass, only 1/4 of the non-embedding parameters are actually used for computation, yet the model_size dict (which is used in all flop calculations) contains the full number of ~47B tokens:

"mixtral_8x7b": 46702792704,

This likely leads to inflated FLOP counts for results using the MoE model (e.g., PAIR, AutoDAN).

Best
Tim

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions