1. Motivation
Training 16k-length sequences currently causes OOM errors #112. To support 100k+ sequences, we need efficient context parallelism (CP). Per https://arxiv.org/abs/2405.07719, USP (Ulysses + Ring Attention) outperforms standalone approaches, making it our top choice.
2. Proposal
Integrate USP into SpecForge. This hybrid approach combines:
Ulysses: Offers better performance
Ring Attention: Enables support for longer sequence lengths
3. Expected Benefits
Enable 100k+ sequence training without OOM
Maintain computational efficiency
Preserve model accuracy at scale