Skip to content

Conversation

@ashwinb
Copy link
Contributor

@ashwinb ashwinb commented Apr 5, 2025

As the title says. Details in README.

ashwinb and others added 4 commits April 5, 2025 10:28
MERGED TILL 2d03d1d7686f6efc0488b20b7cf5ea7ca2c8ae12
Updating readme to have llama cookbook link for marketing comm + HF
updates
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 5, 2025
@ashwinb ashwinb merged commit 5fdf831 into main Apr 5, 2025
1 of 2 checks passed
@ashwinb ashwinb deleted the release-0.2.0 branch April 5, 2025 18:53

class MoE(torch.nn.Module):
"""
This EC implementation is modified from the original EC module.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though 2b2e5b2 tried to clean this docstring up, the current state in main is not correct:

Tensors used in this module are annotated with the suffixes that indicate the shape of the tensor.
Several commonly used annotations include:
- a: bsz*slen
- E: number of experts
- e: number of local experts per ep (n_experts/ep)
- D: hidden dimension
- d: D/tp
- F: model dimension
- G: number of tokens per expert (a * capacity_factor / E)
- g: number of tokens per expert per TP rank (i.e., G/TP)
Examples:
x_aD [a, D]
routed_in_etG_D [et*G, D]
x_eGD: [e, G, D]

  1. I think that capacity_factor and top_k in MoEArgs should have been combined to be the same thing. The comment for capacity_factor is incorrectly for EC, not TC.

    class MoEArgs(BaseModel):
    num_experts: int = -1
    capacity_factor: float = 1.0 # capacity factor determines how many tokens each expert can choose
    auto_scale_F: bool = ( # noqa: N815
    True # if true, rescales hidden_dim such that number of activated params is same as equivalent dense layer
    )
    top_k: int = 1

  2. The annotations G and g as described are for EC, not TC. This means that annotations like routed_in_EG_D are incorrect. The token dimension is not sharded coming out of the experts forward, so the annotation routed_out_eg_D is incorrect.

    routed_out_eg_D = self.experts(routed_in_EG_D.detach())

Since this is the reference implementation, I hope that some more care can be taken in polishing it. Previously, repos like torchtitan used the reference impl as the starting point for their own model definitions, and others in open source may read it to learn from as well. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants