-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: introduce llama4 support #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MERGED TILL 2d03d1d7686f6efc0488b20b7cf5ea7ca2c8ae12
Updating readme to have llama cookbook link for marketing comm + HF updates
|
|
||
| class MoE(torch.nn.Module): | ||
| """ | ||
| This EC implementation is modified from the original EC module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though 2b2e5b2 tried to clean this docstring up, the current state in main is not correct:
llama-models/models/llama4/moe.py
Lines 104 to 118 in 63172b3
| Tensors used in this module are annotated with the suffixes that indicate the shape of the tensor. | |
| Several commonly used annotations include: | |
| - a: bsz*slen | |
| - E: number of experts | |
| - e: number of local experts per ep (n_experts/ep) | |
| - D: hidden dimension | |
| - d: D/tp | |
| - F: model dimension | |
| - G: number of tokens per expert (a * capacity_factor / E) | |
| - g: number of tokens per expert per TP rank (i.e., G/TP) | |
| Examples: | |
| x_aD [a, D] | |
| routed_in_etG_D [et*G, D] | |
| x_eGD: [e, G, D] |
-
I think that
capacity_factorandtop_kinMoEArgsshould have been combined to be the same thing. The comment forcapacity_factoris incorrectly for EC, not TC.
llama-models/models/llama4/args.py
Lines 29 to 35 in 63172b3
class MoEArgs(BaseModel): num_experts: int = -1 capacity_factor: float = 1.0 # capacity factor determines how many tokens each expert can choose auto_scale_F: bool = ( # noqa: N815 True # if true, rescales hidden_dim such that number of activated params is same as equivalent dense layer ) top_k: int = 1 -
The annotations
Gandgas described are for EC, not TC. This means that annotations likerouted_in_EG_Dare incorrect. The token dimension is not sharded coming out of the experts forward, so the annotationrouted_out_eg_Dis incorrect.
llama-models/models/llama4/moe.py
Line 201 in 63172b3
routed_out_eg_D = self.experts(routed_in_EG_D.detach())
Since this is the reference implementation, I hope that some more care can be taken in polishing it. Previously, repos like torchtitan used the reference impl as the starting point for their own model definitions, and others in open source may read it to learn from as well. Thanks!
As the title says. Details in README.