A modular and configurable deep learning training pipeline for text classification tasks, optimized for GPU training with PyTorch and Hugging Face Transformers.
This pipeline provides a flexible framework for training transformer-based models on text classification tasks. It features a clean, modular architecture with configuration management using Hydra, making it easy to experiment with different models, datasets, and training parameters.
- Modular Architecture: Separated components for data loading, model definition, training, and utilities
- Hydra Configuration: Hierarchical configuration system for easy experiment management
- GPU Support: Optimized for CUDA-enabled GPU training with gradient accumulation
- Docker Support: Containerized setup with NVIDIA CUDA runtime for reproducible environments
- Weights & Biases Integration: Optional logging and experiment tracking
- Flexible Model Support: Easy integration with Hugging Face transformers
gpu-training-pipeline/
├── train.py # Main training script
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration for GPU training
├── configs/ # Hydra configuration files
│ ├── config.yaml # Main configuration
│ ├── data/ # Dataset configurations (e.g., IMDB)
│ ├── model/ # Model configurations (e.g., DistilBERT)
│ ├── trainer/ # Training configurations (CPU/GPU)
│ └── task/ # Task-specific configurations
├── scripts/ # Utility scripts
│ └── train_local.sh # Local training script
└── src/ # Source code
├── dataloaders/ # Data loading and preprocessing
├── models/ # Model architectures
├── trainers/ # Training loop implementation
└── utils/ # Utility functions (logging, seeding)
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- Hydra-core
- Accelerate
- Optional: Weights & Biases for experiment tracking
Install dependencies:
pip install -r requirements.txtRun training with default configuration:
python train.pyOverride specific parameters:
python train.py model=distilbert data=imdb trainer=gpu trainer.epochs=5Build and run with Docker:
docker build -t gpu-training-pipeline .
docker run --gpus all gpu-training-pipelineThe pipeline uses Hydra for hierarchical configuration management. Key configuration groups:
- model: Model architecture (e.g., DistilBERT, BERT, RoBERTa)
- data: Dataset configuration (name, batch size, preprocessing)
- trainer: Training parameters (learning rate, epochs, device)
- task: Task-specific settings (classification, regression, etc.)
Example configuration override:
python train.py \
model.model_name=bert-base-uncased \
data.batch_size=32 \
trainer.learning_rate=2e-5 \
trainer.epochs=3- Text classification data loading with tokenization
- Support for Hugging Face datasets
- Configurable batch size and preprocessing
- Base model abstraction
- Text classification model with customizable heads
- Support for freezing base model layers
- Training loop with evaluation
- Gradient accumulation support
- Learning rate scheduling with warmup
- Checkpointing and logging
- Deterministic seeding for reproducibility
- Weights & Biases integration
- Rich console output
- Create a new YAML file in
configs/data/ - Specify dataset name, fields, and preprocessing parameters
- Create a new YAML file in
configs/model/ - Optionally extend base model classes in
src/models/
Extend the Trainer class in src/trainers/trainer.py to implement custom training procedures.
See the main repository for license information.