The Indic LLM Toolkit includes a comprehensive monitoring and logging framework that tracks system resources, training metrics, and inference performance in real-time.
- Real-time System Monitoring: CPU, GPU, memory, disk, and network usage
- Training Metrics: Loss, learning rate, gradient norms, throughput
- Inference Metrics: Latency percentiles, throughput, success rates
- Alert System: Automatic alerts for anomalies and threshold violations
- Multiple Backends: TensorBoard, Weights & Biases, Prometheus
- Interactive Dashboard: Streamlit-based real-time visualization
- Report Generation: Automated HTML reports with key insights
from monitoring.training_monitor import create_monitoring_trainer
from transformers import Trainer
# Create a monitored trainer
MonitoredTrainer = create_monitoring_trainer(Trainer)
# Use it like a normal HuggingFace Trainer
trainer = MonitoredTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train() # Monitoring happens automaticallyfrom monitoring.training_monitor import PyTorchMonitor
with PyTorchMonitor("configs/monitoring_config.yaml") as monitor:
for epoch in range(num_epochs):
monitor.set_epoch(epoch)
for batch in train_loader:
loss = train_step(batch)
monitor.log_step(
loss=loss.item(),
learning_rate=optimizer.param_groups[0]['lr'],
gradient_norm=get_grad_norm(model)
)
# Check for critical alerts
if monitor.should_stop():
break
# Log evaluation metrics
eval_metrics = evaluate(model, val_loader)
monitor.log_eval(eval_metrics)from monitoring.metrics_tracker import MetricsLogger
logger = MetricsLogger("configs/monitoring_config.yaml")
logger.start_monitoring()
# Log inference requests
for request in requests:
start_time = time.time()
response = model.generate(request.input)
latency_ms = (time.time() - start_time) * 1000
logger.log_inference_request(
request_id=request.id,
latency_ms=latency_ms,
tokens_generated=len(response),
success=True
)Edit configs/monitoring_config.yaml to customize monitoring behavior:
# Output directory for logs and reports
output_dir: ./logs/monitoring
# Enable/disable backends
tensorboard:
enabled: true
wandb:
enabled: false
project: your-project-name
prometheus:
enabled: true
port: 8000
# Alert thresholds
system_alerts:
cpu_percent: 85.0
memory_percent: 80.0
gpu_memory_percent: 90.0
alert_thresholds:
loss_spike_ratio: 5.0
learning_rate_min: 1e-8
gradient_norm_max: 10.0Launch the interactive monitoring dashboard:
streamlit run monitoring/dashboard.py -- --config configs/monitoring_config.yamlThe dashboard provides:
- Real-time system resource graphs
- Training loss and learning rate curves
- Inference latency and throughput charts
- Active alerts and alert history
tensorboard --logdir ./logs/tensorboard- Metrics are exposed at
http://localhost:8000 - Configure Prometheus to scrape this endpoint
- Import the provided Grafana dashboard
Metrics are automatically synced if W&B is enabled in the config.
# Log custom metrics during training
monitor.log_step(
loss=loss,
custom_metric=value,
indic_accuracy=indic_acc
)from monitoring.training_monitor import GPUMemoryTracker
with GPUMemoryTracker(logger, "forward_pass"):
outputs = model(inputs)
# Automatically logs GPU memory deltafrom monitoring.training_monitor import monitor_function
@monitor_function(logger, "data_preprocessing")
def preprocess_data(data):
# Function execution time is automatically logged
return processed_dataThe monitoring system can send alerts via:
- Email (configure SMTP settings)
- Slack webhooks
- Log files
Configure in monitoring_config.yaml:
email_alerts:
enabled: true
smtp_host: smtp.gmail.com
smtp_port: 587
username: your-email@gmail.com
password: your-app-password
to: alerts@your-team.com- Start Early: Enable monitoring from the beginning of training
- Set Appropriate Thresholds: Adjust alert thresholds based on your hardware
- Monitor Gradients: Track gradient norms to detect training instabilities
- Use Checkpoints: Save model checkpoints when metrics improve
- Regular Reports: Generate reports after each training run
- Check for memory leaks in data loading
- Enable gradient checkpointing
- Reduce batch size
- Verify the metrics log file is being written
- Check file permissions
- Ensure the config path is correct
- Install GPUtil:
pip install GPUtil - Verify CUDA is properly installed
- Check GPU visibility:
nvidia-smi
The monitoring system is designed to have minimal impact:
- System metrics: ~0.1% CPU overhead
- Training metrics: <1ms per step
- Dashboard: Separate process, no training impact
For distributed training, metrics are automatically aggregated:
# Metrics from all ranks are collected
if args.distributed:
dist_config = DistributedConfig(...)
# Monitoring works seamlessly across nodesExport metrics for further analysis:
# Generate CSV export
logger.export_metrics_csv("metrics_export.csv")
# Generate detailed report
report_path = logger.generate_report()For issues or questions:
- Check logs in
./logs/monitoring/metrics.log - Review the dashboard for system status
- Consult the alert history for past issues