Open
Conversation
This was referenced Aug 18, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds two new pieces of functionality: LoRA and MoE adapters conditioned on task-specific embeddings. These are meant to be learned during finetuning.
LoRA adapters: drop-in replacement for the attention output projection weights, conditioned on downstream tasks. This is meant to be used during finetuning. From this paper it seems like finetuning only these projection weights gets you most of the way there compared to full finetuning.
Efor each downstream task, theTaskLoRALinearlayer computes a LoRA update (ie two matricesA: D x r,B: r x D) whose product is added to the original projection weights. BothAandBare computed via an MLP on top ofE. This MLP is shared across tasks but different per-layer, and is directly learned during finetuning.MoE adapters: instead of computing
FFN(x)in each Transformer block, adds a soft MoE adapter so that the pre-LayerScale output isLinear(FFN(x) + MoE(x)).Optionally MoE adapters compute expert combine weights (i.e., deciding which experts to use per token) by conditioning on batch-level task embeddings instead of on token-level embeddings.
The actual MoE implementation is mostly from this reference implementation and can be found in
helios.nn.moe.To accommodate these changes, there are a few extra arguments added to
EncoderConfig,Encoder,FlexiHeliosBase, etc. all the way down to the basehelios.nn.attention.Attentionlayers. Additionally, there is a new argumenttask_embadded to the forward pass ofEncoder. I considered subclassingEncoder(i.e. something likeEncoderWithTaskEmbeds) but decided it was simpler to just add a few extra arguments directly.