Skip to content
/ baq Public

This project is part of the CPE393 coursework on the topic of the MLOps final project.

chogerlate/baq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Bangkok Air Quality (BAQ) Forecasting

A comprehensive machine learning pipeline for PM2.5 air quality forecasting in Bangkok, Thailand. This project provides end-to-end capabilities for data processing, model training, evaluation, and deployment using multiple ML algorithms including LSTM, Random Forest, and XGBoost. For more information check the presentation and document report in docs/

MLOps Architecture

MLOps Diagram drawio(1)

Tech Stack

image

πŸ”— Related GitHub Repositories for the BAQ Project

Here are the main repositories that make up the BAQ project, covering everything from data pipelines to APIs, experiments, and frontend interfaces.


🏠 Main Repository

  • Purpose: Central codebase and project orchestration
  • URL: chogerlate/baq

⛓️ DAG & Airflow Repository


βš™οΈ FastAPI Backend

  • Purpose: API services for model inference and system integration
  • URL: Saranunt/baq-api

🎨 Streamlit Frontend


πŸ§ͺ Model Experimentation

🌟 Features

Core Capabilities

  • Multi-Model Support: LSTM (deep learning), Random Forest, and XGBoost models
  • Advanced Data Processing: Comprehensive preprocessing pipeline with feature engineering
  • Time Series Forecasting: Single-step and multi-step PM2.5 predictions
  • Experiment Tracking: Integration with Weights & Biases (W&B) for MLOps
  • Model Monitoring: Automated performance monitoring and data drift detection
  • Configuration Management: Hydra-based configuration with YAML files
  • Artifact Management: Model and processor serialization with versioning

Data Processing Features

  • Temporal Feature Engineering: Cyclical time encoding, lag features, rolling statistics
  • Domain-Specific Features: AQI tier classification, weekend/night indicators
  • Robust Data Cleaning: Missing value imputation, outlier handling, seasonal median filling
  • Weather Code Encoding: Categorical weather condition processing
  • Data Validation: Comprehensive quality checks and drift detection

Model Training & Evaluation

  • Cross-Validation: Time series aware validation strategies
  • Performance Metrics: MAE, RMSE, MAPE, RΒ², accuracy calculations
  • Visualization: Prediction plots, performance comparisons, monitoring dashboards
  • Hyperparameter Optimization: Configurable model parameters
  • Early Stopping: Intelligent training termination for deep learning models

πŸ“ Repository Structure

baq/
β”œβ”€β”€ πŸ“„ README.md                           # Project documentation
β”œβ”€β”€ πŸ“„ pyproject.toml                      # Project configuration and dependencies
β”œβ”€β”€ πŸ“„ requirements.txt                    # Python dependencies
β”œβ”€β”€ πŸ“„ .env-example                        # Environment variables template
β”œβ”€β”€ πŸ“„ PERFORMANCE_RESTORATION_SUMMARY.md  # Performance analysis documentation
β”‚
β”œβ”€β”€ πŸ“ configs/                            # Configuration files
β”‚   └── πŸ“„ config.yaml                     # Main configuration file
β”‚
β”œβ”€β”€ πŸ“ src/baq/                            # Main source code package
β”‚   β”œβ”€β”€ πŸ“„ __init__.py                     # Package initialization
β”‚   β”œβ”€β”€ πŸ“„ py.typed                        # Type checking marker
β”‚   β”œβ”€β”€ πŸ“„ run.py                          # Main entry point
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ core/                           # Core functionality
β”‚   β”‚   β”œβ”€β”€ πŸ“„ evaluation.py               # Model evaluation metrics
β”‚   β”‚   └── πŸ“„ inference.py                # Prediction and forecasting logic
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ data/                           # Data processing modules
β”‚   β”‚   β”œβ”€β”€ πŸ“„ processing.py               # Main data preprocessing pipeline
β”‚   β”‚   └── πŸ“„ utils.py                    # Data utility functions
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ models/                         # Model implementations
β”‚   β”‚   └── πŸ“„ lstm.py                     # LSTM model architecture
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ steps/                          # Pipeline steps
β”‚   β”‚   β”œβ”€β”€ πŸ“„ load_data.py                # Data loading step
β”‚   β”‚   β”œβ”€β”€ πŸ“„ process.py                  # Data processing step
β”‚   β”‚   β”œβ”€β”€ πŸ“„ train.py                    # Model training step
β”‚   β”‚   β”œβ”€β”€ πŸ“„ evaluate.py                 # Model evaluation step
β”‚   β”‚   β”œβ”€β”€ πŸ“„ monitoring_report.py        # Performance monitoring
β”‚   β”‚   └── πŸ“„ save_artifacts.py           # Artifact saving step
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ pipelines/                      # ML pipelines
β”‚   β”œβ”€β”€ πŸ“ utils/                          # Utility functions
β”‚   β”œβ”€β”€ πŸ“ scripts/                        # Automation scripts
β”‚   └── πŸ“ action_files/                   # Action configurations
β”‚
β”œβ”€β”€ πŸ“ data/                               # Data storage
β”œβ”€β”€ πŸ“ notebooks/                          # Jupyter notebooks
β”‚   β”œβ”€β”€ πŸ“„ experiment.ipynb                # Experimentation notebook
β”‚   β”œβ”€β”€ πŸ“„ api_call.ipynb                  # API testing notebook
β”‚   β”œβ”€β”€ πŸ“„ wandb.ipynb                     # W&B integration examples
β”‚   └── πŸ“„ test_module.ipynb               # Module testing
β”‚
β”œβ”€β”€ πŸ“ outputs/                            # Pipeline outputs
β”œβ”€β”€ πŸ“ wandb/                              # Weights & Biases artifacts
β”œβ”€β”€ πŸ“ docs/                               # Documentation
└── πŸ“ .github/                            # GitHub workflows

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • pip or uv package manager
  • Optional: AWS S3 access for data storage
  • Optional: Weights & Biases account for experiment tracking

Installation

  1. Clone the repository:
git clone <repository-url>
cd baq
  1. Install dependencies:
# Using pip
pip install -r requirements.txt

# Using uv (recommended)
uv sync
  1. Set up environment variables:
cp .env-example .env
# Edit .env with your configurations
  1. Configure Weights & Biases (optional):
wandb login

Basic Usage

Run the complete training pipeline:

python src/baq/run.py

Run with custom configuration:

python src/baq/run.py model.model_type=lstm training.epochs=100

βš™οΈ Configuration

The project uses Hydra for configuration management. Main configuration file: configs/config.yaml

Key Configuration Sections

Model Configuration

model:
  model_type: "random_forest"  # Options: "random_forest", "xgboost", "lstm"
  random_forest:
    model_params:
      n_estimators: 50
      max_depth: 10
  lstm:
    model_params:
      n_layers: 2
      hidden_size: 512
      dropout: 0.2
    training_params:
      learning_rate: 0.001
      batch_size: 64
      epochs: 100

Training Configuration

training:
  forecast_horizon: 24
  sequence_length: 24
  target_column: "pm2_5_(ΞΌg/mΒ³)"
  test_size: 0.2
  random_state: 42

Experiment Tracking

wandb:
  tags: ["pm2.5", "forecasting", "air-quality"]
  log_model: true
  register_model: false

πŸ”§ Data Processing Pipeline

Input Data Format

The pipeline expects weather and air quality data with temporal features:

  • Meteorological: Temperature, humidity, pressure, wind speed, precipitation
  • Environmental: Soil conditions, UV index, visibility
  • Air Quality: PM2.5 historical values and derived features
  • Temporal: Timestamps for time series analysis

Feature Engineering

The TimeSeriesDataProcessor creates comprehensive features:

  1. Temporal Features:

    • Hour, day, month, day of week
    • Weekend/night indicators
    • Cyclical encoding (sin/cos transformations)
  2. Lag Features:

    • PM2.5 values from 1, 3, 6, 12, 24 hours ago
    • Rolling means and standard deviations
  3. Domain-Specific Features:

    • AQI tier classification (0-5 based on PM2.5 levels)
    • Weather code encoding
  4. Data Quality:

    • Missing value imputation
    • Outlier detection and handling
    • Seasonal median filling

πŸ€– Model Training

Supported Models

1. LSTM (Long Short-Term Memory)

  • Use Case: Complex temporal patterns, long-term dependencies
  • Architecture: Dual-layer LSTM with dropout regularization
  • Features: Early stopping, learning rate scheduling, model checkpointing

2. Random Forest

  • Use Case: Robust baseline, feature importance analysis
  • Features: Ensemble learning, handles non-linear relationships

3. XGBoost

  • Use Case: High performance, gradient boosting
  • Features: Advanced regularization, efficient training

Training Process

  1. Data Loading: Load raw weather and air quality data
  2. Preprocessing: Apply feature engineering and scaling
  3. Model Training: Train selected model with configured parameters
  4. Evaluation: Calculate performance metrics on test set
  5. Artifact Saving: Save trained model and preprocessors
  6. Monitoring: Generate performance and drift reports

πŸ“Š Evaluation & Monitoring

Performance Metrics

  • MAE (Mean Absolute Error): Average prediction error
  • RMSE (Root Mean Square Error): Penalizes large errors
  • MAPE (Mean Absolute Percentage Error): Relative error percentage
  • RΒ² (Coefficient of Determination): Explained variance
  • Accuracy: 1 - MAPE

Forecasting Types

  • Single-Step: Predict next time step
  • Multi-Step: Predict multiple future time steps
  • Iterative Forecasting: Use predictions as inputs for future steps

Monitoring Features

  • Data Drift Detection: Statistical tests for distribution changes
  • Performance Tracking: Metric trends over time
  • Feature Importance: Model interpretability analysis
  • Visualization: Prediction plots, residual analysis

πŸ”¬ Experiment Tracking

Weights & Biases Integration

  • Experiment Logging: Automatic metric and parameter tracking
  • Model Versioning: Artifact management and model registry
  • Visualization: Interactive plots and dashboards
  • Collaboration: Team experiment sharing

Logged Information

  • Model hyperparameters and architecture
  • Training and validation metrics
  • Feature importance scores
  • Prediction visualizations
  • Data quality reports

πŸ› οΈ Development

Project Structure Principles

  • Modular Design: Separate concerns into focused modules
  • Configuration-Driven: Hydra-based parameter management
  • Type Safety: Type hints and py.typed marker
  • Testing: Comprehensive test coverage (notebooks for experimentation)
  • Documentation: Detailed docstrings and examples

Key Modules

src/baq/data/processing.py

  • TimeSeriesDataProcessor: Main preprocessing pipeline
  • Features: Data cleaning, feature engineering, scaling, validation
  • Methods: fit_transform(), transform(), inverse_transform_target()

src/baq/models/lstm.py

  • LSTMForecaster: Deep learning model implementation
  • Features: Configurable architecture, callbacks, early stopping
  • Methods: fit(), predict(), model checkpointing

src/baq/core/inference.py

  • Forecasting Functions: Single-step and multi-step prediction
  • Features: Model-agnostic interface, sequence handling
  • Methods: single_step_forecasting(), multi_step_forecasting()

Adding New Models

  1. Implement model class in src/baq/models/
  2. Add configuration section in config.yaml
  3. Update training logic in src/baq/steps/train.py
  4. Add evaluation support in src/baq/steps/evaluate.py

πŸš€ Deployment

Model Artifacts

  • Model Files: Serialized trained models (.h5, .joblib)
  • Preprocessors: Fitted scalers and encoders (.joblib)
  • Metadata: Training configuration and metrics (.json)

Integration Options

  • Batch Prediction: Process historical data in batches
  • Real-time API: Deploy models as REST APIs
  • Scheduled Jobs: Automated retraining and prediction
  • Cloud Deployment: AWS, GCP, Azure integration

CI/CD Strategy

Our project fully develop workflow orchestration and implements a CI/CD pipeline with a focus on performance validation and best industry practices for responsible deployment. ** You can check this process in #21 ** which demonstrate our the CI/CD strategy.

Overview

  • CI/CD on Code Change:
    Continuous Integration (CI) and Continuous Deployment (CD) are triggered automatically upon every code change. This ensures that our codebase remains robust and testable, with unit and integration tests running on each commit or pull request.

  • Human-in-the-Loop Model Promotion:
    While code changes follow an automated pipeline, model deployments require team approval before going live. This step ensures that model performance is verified by humans and aligns with business objectives before release.

Deployment Flow

We streamline deployment by leveraging Weights & Biases (W&B) and cloud-native practices:

  • The latest model artifact is automatically loaded by our cloud infrastructure on startup.
  • There is no need to rebuild or redeploy Docker images for every model update.
  • This decouples model deployment from application builds, allowing for faster iterations and rollback capabilities.

Benefits

  • βœ… Ensures high model performance before deployment
  • βœ… Reduces deployment time and resource overhead
  • βœ… Encourages responsible ML practices with human validation
  • βœ… Simplifies infrastructure with dynamic model loading

This approach balances automation with accountability, aligning with real-world ML ops best practices.

πŸ” Troubleshooting

Common Issues

  1. Data Loading Errors

    • Check file paths in config.yaml
    • Verify data format and column names
    • Ensure proper datetime indexing
  2. Memory Issues

    • Reduce batch size for LSTM training
    • Use data chunking for large datasets
    • Monitor memory usage during processing
  3. Model Performance

    • Check feature engineering pipeline
    • Verify target column name format
    • Review hyperparameter settings
  4. W&B Connection Issues

    • Verify API key: wandb login
    • Check internet connectivity
    • Review project permissions

Performance Optimization

  • Feature Selection: Use domain knowledge for feature engineering
  • Hyperparameter Tuning: Grid search or Bayesian optimization
  • Data Quality: Ensure clean, consistent input data
  • Model Selection: Choose appropriate algorithm for data characteristics

πŸ“ˆ Performance Improvements

Recent performance restoration includes:

  • Enhanced Feature Engineering: AQI tiers, cyclical encoding, weekend/night indicators
  • Robust Data Processing: Better column handling, weather code encoding
  • Improved Target Handling: Multiple column name format support
  • Extended Rolling Windows: Additional temporal feature scales

See PERFORMANCE_RESTORATION_SUMMARY.md for detailed analysis.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/new-feature
  3. Make changes and add tests
  4. Update documentation as needed
  5. Submit a pull request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add type hints to new functions
  • Include docstrings with examples
  • Test changes with different model types
  • Update configuration documentation

Note: This project is designed for educational and research purposes in air quality forecasting. For production use, additional validation and testing are recommended.

About

This project is part of the CPE393 coursework on the topic of the MLOps final project.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •