You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: This is all based on the assumption that the torchtune tensor_parallelism implementation applies sequence parallelism. If this is not the case please let me know.
The standard torchtune implementation for tensor parallelism works well. I've been able to smoothly apply it to the 70B llamas. However, I'm looking to implement a version which works for Qwen LoRA. Any clues as to how I could correctly divide up the lora adapters and the weight matrices they are applied to?
Referencing the equation W'=W+BA I was thinking something along the lines of "If W is row divided, then row divide B. If W is column divided, then column divide A." Or do I just want to apply the tensor parallelism to W and leave A and B alone?
I'm trying to work with longer sequences here without offloading activations, so the benefits of sequence parallelism are what I am most interested in.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Note: This is all based on the assumption that the torchtune tensor_parallelism implementation applies sequence parallelism. If this is not the case please let me know.
The standard torchtune implementation for tensor parallelism works well. I've been able to smoothly apply it to the 70B llamas. However, I'm looking to implement a version which works for Qwen LoRA. Any clues as to how I could correctly divide up the lora adapters and the weight matrices they are applied to?
Referencing the equation
W'=W+BA
I was thinking something along the lines of "If W is row divided, then row divide B. If W is column divided, then column divide A." Or do I just want to apply the tensor parallelism to W and leave A and B alone?I'm trying to work with longer sequences here without offloading activations, so the benefits of sequence parallelism are what I am most interested in.
Beta Was this translation helpful? Give feedback.
All reactions