Skip to content

liangyuwang/Tiny-Megatron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tiny-Megatron

Tiny-Megatron is a minimalistic, educational re-implementation of the Megatron-LM library for distributed deep learning. This project provides clean, understandable implementations of various parallelism strategies used in large-scale language model training.

πŸš€ Features

Multiple Parallelism Strategies

  • Tensor Parallelism (TP): Split individual layers across multiple devices
  • Data Parallelism (DP): Replicate model across devices, shard data batches
  • 2D Hybrid Parallelism: Combine TP + DP for effective scalability

Core Components

  • Custom Neural Network Modules: Optimized implementations of Linear, Embedding, LayerNorm
  • Automatic Kernel Selection: Runtime auto-tuner for optimal performance
  • Flexible Parallel Context: Easy configuration of multi-dimensional parallelism
  • Wrapper-Based Design: Non-intrusive parallelization of existing models

Educational Focus

  • Clean, Readable Code: Well-documented implementations for learning
  • Modular Architecture: Each parallelism strategy is independently implemented
  • Complete Examples: Full training scripts demonstrating each approach

πŸ“ Project Structure

Tiny-Megatron/
β”œβ”€β”€ tiny_megatron/core/             # πŸ—οΈ Core Library
β”‚   β”œβ”€β”€ dist/                       # Distributed Parallelism
β”‚   β”‚   β”œβ”€β”€ tp/                     # β€’ Tensor Parallelism (TP)
β”‚   β”‚   β”œβ”€β”€ dp/                     # β€’ Data Parallelism (DP)
β”‚   β”‚   β”œβ”€β”€ hybrid/                 # β€’ 2D Hybrid Parallelism (TP + DP)
β”‚   β”‚   └── utils/                  # β€’ Communication utilities
β”‚   β”œβ”€β”€ module/                     # Custom NN Modules
β”‚   β”‚   β”œβ”€β”€ linear.py               # β€’ Optimized Linear layers
β”‚   β”‚   β”œβ”€β”€ embedding.py            # β€’ Embedding layers  
β”‚   β”‚   β”œβ”€β”€ normalization.py        # β€’ LayerNorm implementation
β”‚   β”‚   └── ops/                    # β€’ Low-level operations
β”‚   └── autotuner/                  # Performance Optimization
β”‚       └── runtime_tuner.py        # β€’ Automatic kernel selection
β”‚
β”œβ”€β”€ example/                        # πŸš€ Training Examples
β”‚   β”œβ”€β”€ model.py                    # β€’ GPT-2 model implementation
β”‚   β”œβ”€β”€ tp/train.py                 # β€’ Tensor parallelism demo
β”‚   β”œβ”€β”€ dp/train.py                 # β€’ Data parallelism demo  
β”‚   └── hybrid/train.py             # β€’ 2D hybrid parallelism demo

🎯 Key Components

Component Purpose Key Files
Distributed Parallelism Core parallel strategies dist/{tp,dp,hybrid}/
Custom Modules Optimized NN building blocks module/{linear,embedding}.py
ParallelContext Multi-dimensional coordination dist/utils/comm.py
Auto-tuner Performance optimization autotuner/runtime_tuner.py
Examples Complete training demos example/{tp,dp,hybrid}/

πŸ› οΈ Installation

Prerequisites

  • Python 3.8+
  • PyTorch 2.0+ with CUDA support
  • NCCL for multi-GPU communication

Setup

git clone https://github.com/liangyuwang/Tiny-Megatron.git
cd Tiny-Megatron
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install tqdm

🎯 Quick Start

1. Tensor Parallelism (2 GPUs)

# Split model layers across 2 GPUs
torchrun --nproc_per_node=2 example/tp/train.py

2. Data Parallelism (2 GPUs)

# Replicate model, distribute data batches
torchrun --nproc_per_node=2 example/dp/train.py

3. 2D Hybrid Parallelism (4 GPUs)

# Combine TP and DP: TP=2 x DP=2
torchrun --nproc_per_node=4 example/hybrid/train.py

πŸ’‘ Usage Examples

Basic Tensor Parallelism

import torch
from tiny_megatron.core import ParallelContext, apply_tensor_parallel
from example.model import GPT2Model, GPTConfig

# Initialize distributed environment
# ... (distribution setup code)

# Create model and parallel context
config = GPTConfig()
model = GPT2Model(config).cuda()

# Configure parallelism
parallel_config = {"tp": 2}  # Use 2 GPUs for tensor parallelism
context = ParallelContext(parallel_config)

# Apply tensor parallelism
tp_config = {
    "column_linear_names": ["attn.c_attn", "mlp.c_fc"],
    "row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_tensor_parallel(
    model=model, 
    parallel_context=context,
    tp_config=tp_config
)

# Train normally
optimizer = torch.optim.AdamW(model.parameters())
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()

2D Hybrid Parallelism

from tiny_megatron.core import ParallelContext, apply_hybrid_parallel

# Configure 2D parallelism for 4 GPUs
parallel_config = {
    "tp": 2,  # 2-way tensor parallelism  
    "dp": 2   # 2-way data parallelism
}

context = ParallelContext(parallel_config)

# Apply 2D hybrid parallelism
tp_config = {
    "column_linear_names": ["attn.c_attn", "mlp.c_fc"],
    "row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_hybrid_parallel(
    model=model,
    parallel_context=context,
    tp_config=tp_config
)

πŸ—οΈ Architecture Overview

Parallelism Strategies

Tensor Parallelism (TP)

  • Column Parallel: Split weight matrices column-wise (e.g., attention projections)
  • Row Parallel: Split weight matrices row-wise (e.g., MLP layers)
  • Communication: All-gather for activations, all-reduce for gradients

Data Parallelism (DP)

  • Model Replication: Same model on each device
  • Data Sharding: Different data batches per device
  • Gradient Synchronization: All-reduce after backward pass

2D Hybrid Parallelism

  • Combined Strategy: Tensor Parallelism (TP) + Data Parallelism (DP)
  • Flexible Configuration: Support various TP and DP combinations
  • Efficient Scaling: Optimal resource utilization for medium-scale training

Key Components

ParallelContext

Central coordination for multi-dimensional parallelism:

context = ParallelContext({
    "tp": tensor_parallel_size,
    "dp": data_parallel_size
})

Custom Modules

Optimized implementations with built-in parallelism support:

  • Linear: Matrix multiplication with automatic kernel selection
  • Embedding: Token/position embeddings
  • LayerNorm: Layer normalization

Runtime Auto-Tuner

Automatic selection of optimal kernels:

tuner = RuntimeAutoTuner(
    warmup_iterations=10,
    measure_iterations=100
)

πŸ”§ Configuration

Environment Variables

export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=4
export LOCAL_RANK=0

Parallel Configuration

parallel_config = {
    "tp": 2,    # Tensor parallel size
    "dp": 2,    # Data parallel size
}

πŸ“š Examples

Each parallelism strategy includes complete training examples:

  • example/tp/train.py: Tensor parallelism with GPT-2
  • example/dp/train.py: Data parallelism training
  • example/hybrid/train.py: 2D hybrid parallelism demo

πŸ›£οΈ Roadmap

Currently Supported

  • βœ… Tensor Parallelism (TP): Column and row parallelism for linear layers
  • βœ… Data Parallelism (DP): Standard gradient synchronization
  • βœ… 2D Hybrid Parallelism: TP + DP combinations

Future Plans

To maintain code simplicity and readability, we are currently focusing on TP and DP implementations. Future releases will include:

  • πŸ”„ Pipeline Parallelism (PP): Layer-wise model partitioning
  • πŸ”„ ZeRO Optimizer States: Memory-efficient optimizer state sharding
  • πŸ”„ Expert Parallelism (EP): Mixture-of-experts model scaling
  • πŸ”„ Sequence Parallelism (SP): Sequence dimension parallelism for long contexts
  • πŸ”„ 5D Hybrid Parallelism: TP + EP + SP + DP (ZeRO) + PP combinations

These advanced strategies will be added incrementally while maintaining the educational and minimalistic nature of the codebase.

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ”— Related Projects

πŸ“– Citation

If you use Tiny-Megatron in your research, please cite:

@misc{tiny-megatron,
    title={Tiny-Megatron: A Minimalistic Re-implementation of Megatron-LM},
    author={Liangyu Wang},
    year={2024},
    url={https://github.com/liangyuwang/Tiny-Megatron}
}

About

Tiny-Megatron, a minimalistic re-implementation of the Megatron library

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors