Tiny-Megatron is a minimalistic, educational re-implementation of the Megatron-LM library for distributed deep learning. This project provides clean, understandable implementations of various parallelism strategies used in large-scale language model training.
- Tensor Parallelism (TP): Split individual layers across multiple devices
- Data Parallelism (DP): Replicate model across devices, shard data batches
- 2D Hybrid Parallelism: Combine TP + DP for effective scalability
- Custom Neural Network Modules: Optimized implementations of Linear, Embedding, LayerNorm
- Automatic Kernel Selection: Runtime auto-tuner for optimal performance
- Flexible Parallel Context: Easy configuration of multi-dimensional parallelism
- Wrapper-Based Design: Non-intrusive parallelization of existing models
- Clean, Readable Code: Well-documented implementations for learning
- Modular Architecture: Each parallelism strategy is independently implemented
- Complete Examples: Full training scripts demonstrating each approach
Tiny-Megatron/
βββ tiny_megatron/core/ # ποΈ Core Library
β βββ dist/ # Distributed Parallelism
β β βββ tp/ # β’ Tensor Parallelism (TP)
β β βββ dp/ # β’ Data Parallelism (DP)
β β βββ hybrid/ # β’ 2D Hybrid Parallelism (TP + DP)
β β βββ utils/ # β’ Communication utilities
β βββ module/ # Custom NN Modules
β β βββ linear.py # β’ Optimized Linear layers
β β βββ embedding.py # β’ Embedding layers
β β βββ normalization.py # β’ LayerNorm implementation
β β βββ ops/ # β’ Low-level operations
β βββ autotuner/ # Performance Optimization
β βββ runtime_tuner.py # β’ Automatic kernel selection
β
βββ example/ # π Training Examples
β βββ model.py # β’ GPT-2 model implementation
β βββ tp/train.py # β’ Tensor parallelism demo
β βββ dp/train.py # β’ Data parallelism demo
β βββ hybrid/train.py # β’ 2D hybrid parallelism demo
| Component | Purpose | Key Files |
|---|---|---|
| Distributed Parallelism | Core parallel strategies | dist/{tp,dp,hybrid}/ |
| Custom Modules | Optimized NN building blocks | module/{linear,embedding}.py |
| ParallelContext | Multi-dimensional coordination | dist/utils/comm.py |
| Auto-tuner | Performance optimization | autotuner/runtime_tuner.py |
| Examples | Complete training demos | example/{tp,dp,hybrid}/ |
- Python 3.8+
- PyTorch 2.0+ with CUDA support
- NCCL for multi-GPU communication
git clone https://github.com/liangyuwang/Tiny-Megatron.git
cd Tiny-Megatron
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install tqdm# Split model layers across 2 GPUs
torchrun --nproc_per_node=2 example/tp/train.py# Replicate model, distribute data batches
torchrun --nproc_per_node=2 example/dp/train.py# Combine TP and DP: TP=2 x DP=2
torchrun --nproc_per_node=4 example/hybrid/train.pyimport torch
from tiny_megatron.core import ParallelContext, apply_tensor_parallel
from example.model import GPT2Model, GPTConfig
# Initialize distributed environment
# ... (distribution setup code)
# Create model and parallel context
config = GPTConfig()
model = GPT2Model(config).cuda()
# Configure parallelism
parallel_config = {"tp": 2} # Use 2 GPUs for tensor parallelism
context = ParallelContext(parallel_config)
# Apply tensor parallelism
tp_config = {
"column_linear_names": ["attn.c_attn", "mlp.c_fc"],
"row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_tensor_parallel(
model=model,
parallel_context=context,
tp_config=tp_config
)
# Train normally
optimizer = torch.optim.AdamW(model.parameters())
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()from tiny_megatron.core import ParallelContext, apply_hybrid_parallel
# Configure 2D parallelism for 4 GPUs
parallel_config = {
"tp": 2, # 2-way tensor parallelism
"dp": 2 # 2-way data parallelism
}
context = ParallelContext(parallel_config)
# Apply 2D hybrid parallelism
tp_config = {
"column_linear_names": ["attn.c_attn", "mlp.c_fc"],
"row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_hybrid_parallel(
model=model,
parallel_context=context,
tp_config=tp_config
)- Column Parallel: Split weight matrices column-wise (e.g., attention projections)
- Row Parallel: Split weight matrices row-wise (e.g., MLP layers)
- Communication: All-gather for activations, all-reduce for gradients
- Model Replication: Same model on each device
- Data Sharding: Different data batches per device
- Gradient Synchronization: All-reduce after backward pass
- Combined Strategy: Tensor Parallelism (TP) + Data Parallelism (DP)
- Flexible Configuration: Support various TP and DP combinations
- Efficient Scaling: Optimal resource utilization for medium-scale training
Central coordination for multi-dimensional parallelism:
context = ParallelContext({
"tp": tensor_parallel_size,
"dp": data_parallel_size
})Optimized implementations with built-in parallelism support:
Linear: Matrix multiplication with automatic kernel selectionEmbedding: Token/position embeddingsLayerNorm: Layer normalization
Automatic selection of optimal kernels:
tuner = RuntimeAutoTuner(
warmup_iterations=10,
measure_iterations=100
)export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=4
export LOCAL_RANK=0parallel_config = {
"tp": 2, # Tensor parallel size
"dp": 2, # Data parallel size
}Each parallelism strategy includes complete training examples:
example/tp/train.py: Tensor parallelism with GPT-2example/dp/train.py: Data parallelism trainingexample/hybrid/train.py: 2D hybrid parallelism demo
- β Tensor Parallelism (TP): Column and row parallelism for linear layers
- β Data Parallelism (DP): Standard gradient synchronization
- β 2D Hybrid Parallelism: TP + DP combinations
To maintain code simplicity and readability, we are currently focusing on TP and DP implementations. Future releases will include:
- π Pipeline Parallelism (PP): Layer-wise model partitioning
- π ZeRO Optimizer States: Memory-efficient optimizer state sharding
- π Expert Parallelism (EP): Mixture-of-experts model scaling
- π Sequence Parallelism (SP): Sequence dimension parallelism for long contexts
- π 5D Hybrid Parallelism: TP + EP + SP + DP (ZeRO) + PP combinations
These advanced strategies will be added incrementally while maintaining the educational and minimalistic nature of the codebase.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Megatron-LM: Original Megatron library
- Tiny-FSDP: Minimalistic PyTorch FSDP re-implementation
- Tiny-DeepSpeed: Minimalistic DeepSpeed re-implementation
If you use Tiny-Megatron in your research, please cite:
@misc{tiny-megatron,
title={Tiny-Megatron: A Minimalistic Re-implementation of Megatron-LM},
author={Liangyu Wang},
year={2024},
url={https://github.com/liangyuwang/Tiny-Megatron}
}