PyTorch Distributed Training Framework

Production-ready framework for distributed deep learning training with Ray and Horovod backends. Optimized for computer vision and time series tasks on Kubernetes clusters.

Features

Multiple Backends: Ray and Horovod support for flexible distributed training
Vision Transformers: Optimized ViT implementations for defect detection
Temporal Networks: TCN and Transformer architectures for sequence modeling
Auto-scaling: Dynamic batch sizing and gradient accumulation
Mixed Precision: Automatic FP16 training with 30% speedup
Comprehensive Monitoring: Prometheus metrics and W&B integration
Production Ready: Checkpointing, fault tolerance, and inference optimization
Kubernetes Native: Ready-to-deploy configurations for K8s clusters

Performance Benchmarks

Metric	Value	Notes
Accuracy	94%	Defect detection on industrial dataset
Training Speed	2.3x faster	8x V100 GPUs vs single GPU baseline
Scalability	Linear up to 16 nodes	Tested on AWS P3/P4 instances
Inference Latency	<200ms P95	Optimized for real-time applications
Memory Efficiency	35% reduction	Mixed precision + gradient accumulation

Architecture Overview

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    PyTorch Distributed Training             │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │
│  │   Data Layer    │    │  Training Core  │    │  Backend Layer  │  │
│  │                 │    │                 │    │                 │  │
│  │ • Dataset Loaders│    │ • Trainer       │    │ • Ray Backend   │  │
│  │ • Preprocessing │    │ • Metrics       │    │ • Horovod       │  │
│  │ • Augmentation  │    │ • Checkpointing │    │ • DDP Support   │  │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘  │
│           │                       │                       │         │
│           └───────────────────────┼───────────────────────┘         │
│                                   │                                 │
│                      ┌─────────────────     ┐                      │
│                      │  Model Zoo           │                      │
│                      │                      │                      │
│                      │ • Vision Transformers│                      │
│                      │ • Temporal Networks  │                      │
│                      │ • Custom Models      │                      │
│                      └─────────────────     ┘                      │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │
│  │   Observability │    │   Production    │    │   Deployment    │  │
│  │                 │    │                 │    │                 │  │
│  │ • Prometheus    │    │ • Fault Tolerance│    │ • Kubernetes    │  │
│  │ • Weights & Biases│   │ • Auto-scaling  │    │ • Docker        │  │
│  │ • Logging       │    │ • Mixed Precision│    │ • Cloud        │  │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Design Decisions

Component	Choice	Rationale
Orchestration	Ray + Horovod	Ray for simplicity, Horovod for performance
Precision	Mixed FP16/FP32	30% speedup with minimal accuracy loss
Checkpointing	Async + Compression	Fault tolerance + storage efficiency
Monitoring	Prometheus + W&B	Production observability + experiment tracking
Scaling	Dynamic batching	Optimal GPU utilization across configurations

Quick Start

Prerequisites

System Requirements

Python: 3.8, 3.9, 3.10, 3.11
PyTorch: 2.0+ (with CUDA support for GPU training)
CUDA: 11.0+ (recommended for GPU training)
Memory: 8GB+ RAM (16GB+ recommended for large models)
Storage: 50GB+ free space (for datasets and checkpoints)

Optional Dependencies

Kubernetes: 1.19+ (for production deployment)
Docker: 20.0+ (for containerized deployment)
NVIDIA GPU: With CUDA support (for GPU acceleration)
AWS/Azure/GCP: Cloud environment for scalable training

Supported Platforms

Linux: Ubuntu 18.04+, CentOS 7+, RHEL 7+
macOS: 10.15+ (CPU only)
Windows: 10+ (CPU only, experimental)

Network Requirements

Bandwidth: 10Gbps+ for multi-node training
Latency: <1ms between nodes (recommended)
Ports: 22 (SSH), 6379 (Redis), 9090 (Prometheus)

Installation

PyPI Installation (Recommended)

pip install pytorch-distributed-training

Development Installation

# Clone the repository
git clone https://github.com/lucien-vallois/pytorch-distributed-training.git
cd pytorch-distributed-training

# Install core dependencies
pip install -r requirements.txt

# For development (includes testing, linting, docs tools)
pip install -r requirements-dev.txt

# Install in editable mode for development
pip install -e .

Docker Installation

# Build the container
docker build -t pytorch-distributed-training .

# Run with GPU support
docker run --gpus all -it pytorch-distributed-training

# Run with custom configuration
docker run --gpus all \
  -v $(pwd)/data:/data \
  -v $(pwd)/checkpoints:/checkpoints \
  pytorch-distributed-training \
  python examples/train_vit.py --data-dir /data --checkpoint-dir /checkpoints

Kubernetes Deployment

# Apply the training job
kubectl apply -f k8s/

# Monitor training progress
kubectl logs -f job/pytorch-distributed-training

# Scale the deployment
kubectl scale deployment pytorch-distributed-training --replicas=4

Conda Environment

# Create conda environment
conda create -n pytorch-distributed python=3.9
conda activate pytorch-distributed

# Install PyTorch (choose your CUDA version)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install the framework
pip install pytorch-distributed-training

Training Vision Transformer

from src.trainer import create_trainer, TrainingConfig
from src.models.vision_transformer import vit_base_patch16_224

# Configure training
config = TrainingConfig(
    backend="ray",
    batch_size=32,
    learning_rate=1e-3,
    max_epochs=100,
    mixed_precision=True
)

# Create model and trainer
model = vit_base_patch16_224(num_classes=2)
trainer = create_trainer("ray", config, model)
trainer.setup_distributed()

# Train
trainer.fit(train_loader, val_loader)

Distributed Training with Horovod

# Single node, multiple GPUs
horovodrun -np 4 -H localhost:4 python examples/train_vit.py --backend horovod

# Multi-node training
horovodrun -np 16 -H server1:8,server2:8 python examples/train_vit.py --backend horovod

Ray-based Training

# Local multi-GPU
python examples/train_vit.py --backend ray

# Ray cluster (auto-scaling)
ray submit cluster.yaml examples/train_vit.py --backend ray

Models

Vision Transformer

from src.models.vision_transformer import vit_base_patch16_224

model = vit_base_patch16_224(num_classes=10)

Temporal Convolutional Network

from src.models.temporal_network import TCNClassifier, create_temporal_model

# TCN for classification
model = TCNClassifier(input_size=64, num_classes=5)

# Sequence-to-sequence TCN
model = create_temporal_model("seq2seq_tcn", input_size=10, output_size=1)

Data Loading

Defect Detection Dataset

from src.data.dataset_loader import create_defect_detection_loaders

train_loader, val_loader = create_defect_detection_loaders(
    data_dir="/path/to/dataset",
    batch_size=32,
    distributed=True
)

Time Series Dataset

from src.data.dataset_loader import create_time_series_loaders

train_loader, val_loader = create_time_series_loaders(
    csv_path="data/timeseries.csv",
    feature_cols=["feature1", "feature2"],
    target_col="target",
    sequence_length=100
)

Monitoring & Metrics

Weights & Biases Integration

config = TrainingConfig(use_wandb=True)
# Automatic logging of all metrics

Prometheus Metrics

from src.utils.metrics import MetricsTracker

tracker = MetricsTracker(use_prometheus=True)
# Expose /metrics endpoint for monitoring

Checkpointing

Automatic Checkpointing

from src.utils.checkpointing import CheckpointManager

manager = CheckpointManager("./checkpoints")
# Automatic saving every N steps

Resume Training

trainer.load_checkpoint("checkpoints/checkpoint_10000_*.pt")
trainer.fit(train_loader, val_loader)  # Continues from checkpoint

Inference

Distributed Inference

from examples.distributed_inference import DistributedInferenceEngine

engine = DistributedInferenceEngine("model.pt")
results = engine.predict_dataset(data_loader)

print(f"Accuracy: {results['accuracy']:.4f}")
print(f"Throughput: {results['throughput_samples_per_sec']:.2f} samples/sec")

Production Patterns

Based on techniques used in:

Aerospace defect detection systems
Real-time prediction pipelines
Federated learning across distributed GPUs

Roadmap

v1.1.0 (Next Release)

Multi-framework Support: Add support for TensorFlow/Keras models
Advanced Scheduling: Implement custom learning rate schedulers
Model Quantization: Automatic model quantization for inference
Experiment Tracking: Enhanced integration with MLflow
Profiling Tools: Built-in performance profiling utilities

v1.2.0 (Future)

Federated Learning: Support for federated training scenarios
Edge Deployment: Optimized for edge devices and IoT
AutoML Integration: Automated hyperparameter optimization
Model Serving: Built-in model serving capabilities
Cloud Integration: Native support for major cloud providers

v2.0.0 (Long-term)

Multi-modal Training: Support for vision + language models
Reinforcement Learning: RL training capabilities
Distributed Inference: Large-scale inference pipelines
Real-time Training: Streaming data training support

Kubernetes Deployment

Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4
  completions: 1
  template:
    spec:
      containers:
      - name: trainer
        image: your/training-image
        command: ["horovodrun", "-np", "4", "python", "train.py"]
        resources:
          limits:
            nvidia.com/gpu: 1

Inference Service

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: inference
        image: your/inference-image
        ports:
        - containerPort: 8080
        resources:
          requests:
            nvidia.com/gpu: 1

Configuration

All training parameters can be configured via the TrainingConfig class:

config = TrainingConfig(
    backend="horovod",
    batch_size=64,
    learning_rate=1e-3,
    weight_decay=1e-4,
    max_epochs=200,
    gradient_accumulation_steps=2,
    mixed_precision=True,
    use_wandb=True,
    checkpoint_dir="./checkpoints"
)

Development

Local Development Setup

# Clone and setup
git clone https://github.com/lucien-vallois/pytorch-distributed-training.git
cd pytorch-distributed-training

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -r requirements-dev.txt
pip install -e .

# Run tests
pytest tests/ -v

# Run linting
black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/

Code Quality Standards

Testing: 85%+ code coverage required
Linting: Black formatting, Flake8 compliance
Types: Full mypy type hints
Documentation: All public APIs documented
Security: Bandit security scanning

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Start for Contributors

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Run the full test suite: pytest tests/ --cov=src
Format code: black src/ tests/ && isort src/ tests/
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v --cov=src --cov-report=html

# Run linting
flake8 src/ tests/
black src/ tests/ --check

# Run type checking
mypy src/

Code Quality

Testing: 90%+ code coverage required
Linting: Black formatting, Flake8 compliance
Types: Full mypy type hints
Documentation: All public APIs documented

Documentation

Detailed documentation is available in the docs/ directory. Key topics:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

PyTorch team for the excellent deep learning framework
Ray and Horovod communities for distributed training backends
Open-source contributors who made this possible

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Read the Docs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
k8s		k8s
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

PyTorch Distributed Training Framework

Features

Performance Benchmarks

Architecture Overview

Core Components

Design Decisions

Quick Start

Prerequisites

System Requirements

Optional Dependencies

Supported Platforms

Network Requirements

Installation

PyPI Installation (Recommended)

Development Installation

Docker Installation

Kubernetes Deployment

Conda Environment

Training Vision Transformer

Distributed Training with Horovod

Ray-based Training

Models

Vision Transformer

Temporal Convolutional Network

Data Loading

Defect Detection Dataset

Time Series Dataset

Monitoring & Metrics

Weights & Biases Integration

Prometheus Metrics

Checkpointing

Automatic Checkpointing

Resume Training

Inference

Distributed Inference

Production Patterns

Roadmap

v1.1.0 (Next Release)

v1.2.0 (Future)

v2.0.0 (Long-term)

Kubernetes Deployment

Training Job

Inference Service

Configuration

Development

Local Development Setup

Code Quality Standards

Contributing

Quick Start for Contributors

Development Setup

Code Quality

Documentation

License

Acknowledgments

Support

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages