Skip to content

Latest commit

 

History

History
266 lines (194 loc) · 6.24 KB

File metadata and controls

266 lines (194 loc) · 6.24 KB

Monitoring and Logging Guide for Indic LLM Toolkit

Overview

The Indic LLM Toolkit includes a comprehensive monitoring and logging framework that tracks system resources, training metrics, and inference performance in real-time.

Features

  • Real-time System Monitoring: CPU, GPU, memory, disk, and network usage
  • Training Metrics: Loss, learning rate, gradient norms, throughput
  • Inference Metrics: Latency percentiles, throughput, success rates
  • Alert System: Automatic alerts for anomalies and threshold violations
  • Multiple Backends: TensorBoard, Weights & Biases, Prometheus
  • Interactive Dashboard: Streamlit-based real-time visualization
  • Report Generation: Automated HTML reports with key insights

Quick Start

1. Basic Training Integration

from monitoring.training_monitor import create_monitoring_trainer
from transformers import Trainer

# Create a monitored trainer
MonitoredTrainer = create_monitoring_trainer(Trainer)

# Use it like a normal HuggingFace Trainer
trainer = MonitoredTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()  # Monitoring happens automatically

2. PyTorch Training Loop

from monitoring.training_monitor import PyTorchMonitor

with PyTorchMonitor("configs/monitoring_config.yaml") as monitor:
    for epoch in range(num_epochs):
        monitor.set_epoch(epoch)
        
        for batch in train_loader:
            loss = train_step(batch)
            
            monitor.log_step(
                loss=loss.item(),
                learning_rate=optimizer.param_groups[0]['lr'],
                gradient_norm=get_grad_norm(model)
            )
            
            # Check for critical alerts
            if monitor.should_stop():
                break
        
        # Log evaluation metrics
        eval_metrics = evaluate(model, val_loader)
        monitor.log_eval(eval_metrics)

3. Inference Monitoring

from monitoring.metrics_tracker import MetricsLogger

logger = MetricsLogger("configs/monitoring_config.yaml")
logger.start_monitoring()

# Log inference requests
for request in requests:
    start_time = time.time()
    
    response = model.generate(request.input)
    
    latency_ms = (time.time() - start_time) * 1000
    logger.log_inference_request(
        request_id=request.id,
        latency_ms=latency_ms,
        tokens_generated=len(response),
        success=True
    )

Configuration

Edit configs/monitoring_config.yaml to customize monitoring behavior:

# Output directory for logs and reports
output_dir: ./logs/monitoring

# Enable/disable backends
tensorboard:
  enabled: true
  
wandb:
  enabled: false
  project: your-project-name
  
prometheus:
  enabled: true
  port: 8000

# Alert thresholds
system_alerts:
  cpu_percent: 85.0
  memory_percent: 80.0
  gpu_memory_percent: 90.0
  
alert_thresholds:
  loss_spike_ratio: 5.0
  learning_rate_min: 1e-8
  gradient_norm_max: 10.0

Running the Dashboard

Launch the interactive monitoring dashboard:

streamlit run monitoring/dashboard.py -- --config configs/monitoring_config.yaml

The dashboard provides:

  • Real-time system resource graphs
  • Training loss and learning rate curves
  • Inference latency and throughput charts
  • Active alerts and alert history

Viewing Metrics

TensorBoard

tensorboard --logdir ./logs/tensorboard

Prometheus + Grafana

  1. Metrics are exposed at http://localhost:8000
  2. Configure Prometheus to scrape this endpoint
  3. Import the provided Grafana dashboard

Weights & Biases

Metrics are automatically synced if W&B is enabled in the config.

Advanced Features

Custom Metrics

# Log custom metrics during training
monitor.log_step(
    loss=loss,
    custom_metric=value,
    indic_accuracy=indic_acc
)

GPU Memory Tracking

from monitoring.training_monitor import GPUMemoryTracker

with GPUMemoryTracker(logger, "forward_pass"):
    outputs = model(inputs)
# Automatically logs GPU memory delta

Function Monitoring

from monitoring.training_monitor import monitor_function

@monitor_function(logger, "data_preprocessing")
def preprocess_data(data):
    # Function execution time is automatically logged
    return processed_data

Alerts and Notifications

The monitoring system can send alerts via:

  • Email (configure SMTP settings)
  • Slack webhooks
  • Log files

Configure in monitoring_config.yaml:

email_alerts:
  enabled: true
  smtp_host: smtp.gmail.com
  smtp_port: 587
  username: your-email@gmail.com
  password: your-app-password
  to: alerts@your-team.com

Best Practices

  1. Start Early: Enable monitoring from the beginning of training
  2. Set Appropriate Thresholds: Adjust alert thresholds based on your hardware
  3. Monitor Gradients: Track gradient norms to detect training instabilities
  4. Use Checkpoints: Save model checkpoints when metrics improve
  5. Regular Reports: Generate reports after each training run

Troubleshooting

High Memory Usage

  • Check for memory leaks in data loading
  • Enable gradient checkpointing
  • Reduce batch size

Dashboard Not Updating

  • Verify the metrics log file is being written
  • Check file permissions
  • Ensure the config path is correct

Missing GPU Metrics

  • Install GPUtil: pip install GPUtil
  • Verify CUDA is properly installed
  • Check GPU visibility: nvidia-smi

Performance Impact

The monitoring system is designed to have minimal impact:

  • System metrics: ~0.1% CPU overhead
  • Training metrics: <1ms per step
  • Dashboard: Separate process, no training impact

Integration with Distributed Training

For distributed training, metrics are automatically aggregated:

# Metrics from all ranks are collected
if args.distributed:
    dist_config = DistributedConfig(...)
    # Monitoring works seamlessly across nodes

Exporting Data

Export metrics for further analysis:

# Generate CSV export
logger.export_metrics_csv("metrics_export.csv")

# Generate detailed report
report_path = logger.generate_report()

Support

For issues or questions:

  • Check logs in ./logs/monitoring/metrics.log
  • Review the dashboard for system status
  • Consult the alert history for past issues