feat: introduce llama4 support #299

ashwinb · 2025-04-05T18:50:39Z

As the title says. Details in README.

MERGED TILL 2d03d1d7686f6efc0488b20b7cf5ea7ca2c8ae12

Updating readme to have llama cookbook link for marketing comm + HF updates

awgu · 2025-04-08T01:42:49Z

models/llama4/moe.py

+
+class MoE(torch.nn.Module):
+    """
+    This EC implementation is modified from the original EC module.


Even though 2b2e5b2 tried to clean this docstring up, the current state in main is not correct:

llama-models/models/llama4/moe.py

Lines 104 to 118 in 63172b3

Tensors used in this module are annotated with the suffixes that indicate the shape of the tensor.

Several commonly used annotations include:

- a: bsz*slen

- E: number of experts

- e: number of local experts per ep (n_experts/ep)

- D: hidden dimension

- d: D/tp

- F: model dimension

- G: number of tokens per expert (a * capacity_factor / E)

- g: number of tokens per expert per TP rank (i.e., G/TP)

Examples:

x_aD [a, D]

routed_in_etG_D [et*G, D]

x_eGD: [e, G, D]

I think that capacity_factor and top_k in MoEArgs should have been combined to be the same thing. The comment for capacity_factor is incorrectly for EC, not TC.

llama-models/models/llama4/args.py

Lines 29 to 35 in 63172b3

class MoEArgs(BaseModel):

num_experts: int = -1

capacity_factor: float = 1.0 # capacity factor determines how many tokens each expert can choose

auto_scale_F: bool = ( # noqa: N815

True # if true, rescales hidden_dim such that number of activated params is same as equivalent dense layer

)

top_k: int = 1

The annotations G and g as described are for EC, not TC. This means that annotations like routed_in_EG_D are incorrect. The token dimension is not sharded coming out of the experts forward, so the annotation routed_out_eg_D is incorrect.

llama-models/models/llama4/moe.py

Line 201 in 63172b3

routed_out_eg_D = self.experts(routed_in_EG_D.detach())

Since this is the reference implementation, I hope that some more care can be taken in polishing it. Previously, repos like torchtitan used the reference impl as the starting point for their own model definitions, and others in open source may read it to learn from as well. Thanks!

ashwinb and others added 4 commits April 5, 2025 10:28

feat: introduce llama4 support

d42f395

MERGED TILL 2d03d1d7686f6efc0488b20b7cf5ea7ca2c8ae12

readme updates for cookbook and HF (#67)

296bf13

Updating readme to have llama cookbook link for marketing comm + HF updates

Update MODEL_CARD.md

d446b2d

Update MODEL_CARD.md (#70)

a14df81

ashwinb requested review from dltn, ehhuang, hardikjshah, raghotham and yanxi0830 as code owners April 5, 2025 18:50

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 5, 2025

hardikjshah approved these changes Apr 5, 2025

View reviewed changes

ashwinb merged commit 5fdf831 into main Apr 5, 2025
1 of 2 checks passed

ashwinb deleted the release-0.2.0 branch April 5, 2025 18:53

awgu reviewed Apr 8, 2025

View reviewed changes

Veronikatz mentioned this pull request Jun 11, 2025

llama download: error: Download failed: [Errno 13] Permission denied: '/home/ossama/.llama/checkpoints' #365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: introduce llama4 support #299

feat: introduce llama4 support #299

Uh oh!

ashwinb commented Apr 5, 2025

Uh oh!

Uh oh!

awgu Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

	Tensors used in this module are annotated with the suffixes that indicate the shape of the tensor.
	Several commonly used annotations include:
	- a: bsz*slen
	- E: number of experts
	- e: number of local experts per ep (n_experts/ep)
	- D: hidden dimension
	- d: D/tp
	- F: model dimension
	- G: number of tokens per expert (a * capacity_factor / E)
	- g: number of tokens per expert per TP rank (i.e., G/TP)

	Examples:
	x_aD [a, D]
	routed_in_etG_D [et*G, D]
	x_eGD: [e, G, D]

	class MoEArgs(BaseModel):
	num_experts: int = -1
	capacity_factor: float = 1.0 # capacity factor determines how many tokens each expert can choose
	auto_scale_F: bool = ( # noqa: N815
	True # if true, rescales hidden_dim such that number of activated params is same as equivalent dense layer
	)
	top_k: int = 1

feat: introduce llama4 support #299

feat: introduce llama4 support #299

Uh oh!

Conversation

ashwinb commented Apr 5, 2025

Uh oh!

Uh oh!

awgu Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants