A comprehensive machine learning pipeline for PM2.5 air quality forecasting in Bangkok, Thailand. This project provides end-to-end capabilities for data processing, model training, evaluation, and deployment using multiple ML algorithms including LSTM, Random Forest, and XGBoost. For more information check the presentation and document report in docs/
Here are the main repositories that make up the BAQ project, covering everything from data pipelines to APIs, experiments, and frontend interfaces.
- Purpose: Central codebase and project orchestration
- URL: chogerlate/baq
- Purpose: Airflow DAGs for ETL and scheduled workflows
- URL: Saranunt/baq-airflow
- Purpose: API services for model inference and system integration
- URL: Saranunt/baq-api
- Purpose: Interactive web UI for exploring model outputs and results
- URL: tawayahc/baq-frontend
- Purpose: Notebooks, training scripts, and experimental ML workflows
- URL: tawayahc/baq-experiment
- Multi-Model Support: LSTM (deep learning), Random Forest, and XGBoost models
- Advanced Data Processing: Comprehensive preprocessing pipeline with feature engineering
- Time Series Forecasting: Single-step and multi-step PM2.5 predictions
- Experiment Tracking: Integration with Weights & Biases (W&B) for MLOps
- Model Monitoring: Automated performance monitoring and data drift detection
- Configuration Management: Hydra-based configuration with YAML files
- Artifact Management: Model and processor serialization with versioning
- Temporal Feature Engineering: Cyclical time encoding, lag features, rolling statistics
- Domain-Specific Features: AQI tier classification, weekend/night indicators
- Robust Data Cleaning: Missing value imputation, outlier handling, seasonal median filling
- Weather Code Encoding: Categorical weather condition processing
- Data Validation: Comprehensive quality checks and drift detection
- Cross-Validation: Time series aware validation strategies
- Performance Metrics: MAE, RMSE, MAPE, RΒ², accuracy calculations
- Visualization: Prediction plots, performance comparisons, monitoring dashboards
- Hyperparameter Optimization: Configurable model parameters
- Early Stopping: Intelligent training termination for deep learning models
baq/
βββ π README.md # Project documentation
βββ π pyproject.toml # Project configuration and dependencies
βββ π requirements.txt # Python dependencies
βββ π .env-example # Environment variables template
βββ π PERFORMANCE_RESTORATION_SUMMARY.md # Performance analysis documentation
β
βββ π configs/ # Configuration files
β βββ π config.yaml # Main configuration file
β
βββ π src/baq/ # Main source code package
β βββ π __init__.py # Package initialization
β βββ π py.typed # Type checking marker
β βββ π run.py # Main entry point
β β
β βββ π core/ # Core functionality
β β βββ π evaluation.py # Model evaluation metrics
β β βββ π inference.py # Prediction and forecasting logic
β β
β βββ π data/ # Data processing modules
β β βββ π processing.py # Main data preprocessing pipeline
β β βββ π utils.py # Data utility functions
β β
β βββ π models/ # Model implementations
β β βββ π lstm.py # LSTM model architecture
β β
β βββ π steps/ # Pipeline steps
β β βββ π load_data.py # Data loading step
β β βββ π process.py # Data processing step
β β βββ π train.py # Model training step
β β βββ π evaluate.py # Model evaluation step
β β βββ π monitoring_report.py # Performance monitoring
β β βββ π save_artifacts.py # Artifact saving step
β β
β βββ π pipelines/ # ML pipelines
β βββ π utils/ # Utility functions
β βββ π scripts/ # Automation scripts
β βββ π action_files/ # Action configurations
β
βββ π data/ # Data storage
βββ π notebooks/ # Jupyter notebooks
β βββ π experiment.ipynb # Experimentation notebook
β βββ π api_call.ipynb # API testing notebook
β βββ π wandb.ipynb # W&B integration examples
β βββ π test_module.ipynb # Module testing
β
βββ π outputs/ # Pipeline outputs
βββ π wandb/ # Weights & Biases artifacts
βββ π docs/ # Documentation
βββ π .github/ # GitHub workflows
- Python 3.10+
- pip or uv package manager
- Optional: AWS S3 access for data storage
- Optional: Weights & Biases account for experiment tracking
- Clone the repository:
git clone <repository-url>
cd baq- Install dependencies:
# Using pip
pip install -r requirements.txt
# Using uv (recommended)
uv sync- Set up environment variables:
cp .env-example .env
# Edit .env with your configurations- Configure Weights & Biases (optional):
wandb loginRun the complete training pipeline:
python src/baq/run.pyRun with custom configuration:
python src/baq/run.py model.model_type=lstm training.epochs=100The project uses Hydra for configuration management. Main configuration file: configs/config.yaml
model:
model_type: "random_forest" # Options: "random_forest", "xgboost", "lstm"
random_forest:
model_params:
n_estimators: 50
max_depth: 10
lstm:
model_params:
n_layers: 2
hidden_size: 512
dropout: 0.2
training_params:
learning_rate: 0.001
batch_size: 64
epochs: 100training:
forecast_horizon: 24
sequence_length: 24
target_column: "pm2_5_(ΞΌg/mΒ³)"
test_size: 0.2
random_state: 42wandb:
tags: ["pm2.5", "forecasting", "air-quality"]
log_model: true
register_model: falseThe pipeline expects weather and air quality data with temporal features:
- Meteorological: Temperature, humidity, pressure, wind speed, precipitation
- Environmental: Soil conditions, UV index, visibility
- Air Quality: PM2.5 historical values and derived features
- Temporal: Timestamps for time series analysis
The TimeSeriesDataProcessor creates comprehensive features:
-
Temporal Features:
- Hour, day, month, day of week
- Weekend/night indicators
- Cyclical encoding (sin/cos transformations)
-
Lag Features:
- PM2.5 values from 1, 3, 6, 12, 24 hours ago
- Rolling means and standard deviations
-
Domain-Specific Features:
- AQI tier classification (0-5 based on PM2.5 levels)
- Weather code encoding
-
Data Quality:
- Missing value imputation
- Outlier detection and handling
- Seasonal median filling
- Use Case: Complex temporal patterns, long-term dependencies
- Architecture: Dual-layer LSTM with dropout regularization
- Features: Early stopping, learning rate scheduling, model checkpointing
- Use Case: Robust baseline, feature importance analysis
- Features: Ensemble learning, handles non-linear relationships
- Use Case: High performance, gradient boosting
- Features: Advanced regularization, efficient training
- Data Loading: Load raw weather and air quality data
- Preprocessing: Apply feature engineering and scaling
- Model Training: Train selected model with configured parameters
- Evaluation: Calculate performance metrics on test set
- Artifact Saving: Save trained model and preprocessors
- Monitoring: Generate performance and drift reports
- MAE (Mean Absolute Error): Average prediction error
- RMSE (Root Mean Square Error): Penalizes large errors
- MAPE (Mean Absolute Percentage Error): Relative error percentage
- RΒ² (Coefficient of Determination): Explained variance
- Accuracy: 1 - MAPE
- Single-Step: Predict next time step
- Multi-Step: Predict multiple future time steps
- Iterative Forecasting: Use predictions as inputs for future steps
- Data Drift Detection: Statistical tests for distribution changes
- Performance Tracking: Metric trends over time
- Feature Importance: Model interpretability analysis
- Visualization: Prediction plots, residual analysis
- Experiment Logging: Automatic metric and parameter tracking
- Model Versioning: Artifact management and model registry
- Visualization: Interactive plots and dashboards
- Collaboration: Team experiment sharing
- Model hyperparameters and architecture
- Training and validation metrics
- Feature importance scores
- Prediction visualizations
- Data quality reports
- Modular Design: Separate concerns into focused modules
- Configuration-Driven: Hydra-based parameter management
- Type Safety: Type hints and py.typed marker
- Testing: Comprehensive test coverage (notebooks for experimentation)
- Documentation: Detailed docstrings and examples
- TimeSeriesDataProcessor: Main preprocessing pipeline
- Features: Data cleaning, feature engineering, scaling, validation
- Methods:
fit_transform(),transform(),inverse_transform_target()
- LSTMForecaster: Deep learning model implementation
- Features: Configurable architecture, callbacks, early stopping
- Methods:
fit(),predict(), model checkpointing
- Forecasting Functions: Single-step and multi-step prediction
- Features: Model-agnostic interface, sequence handling
- Methods:
single_step_forecasting(),multi_step_forecasting()
- Implement model class in
src/baq/models/ - Add configuration section in
config.yaml - Update training logic in
src/baq/steps/train.py - Add evaluation support in
src/baq/steps/evaluate.py
- Model Files: Serialized trained models (.h5, .joblib)
- Preprocessors: Fitted scalers and encoders (.joblib)
- Metadata: Training configuration and metrics (.json)
- Batch Prediction: Process historical data in batches
- Real-time API: Deploy models as REST APIs
- Scheduled Jobs: Automated retraining and prediction
- Cloud Deployment: AWS, GCP, Azure integration
Our project fully develop workflow orchestration and implements a CI/CD pipeline with a focus on performance validation and best industry practices for responsible deployment. ** You can check this process in #21 ** which demonstrate our the CI/CD strategy.
-
CI/CD on Code Change:
Continuous Integration (CI) and Continuous Deployment (CD) are triggered automatically upon every code change. This ensures that our codebase remains robust and testable, with unit and integration tests running on each commit or pull request. -
Human-in-the-Loop Model Promotion:
While code changes follow an automated pipeline, model deployments require team approval before going live. This step ensures that model performance is verified by humans and aligns with business objectives before release.
We streamline deployment by leveraging Weights & Biases (W&B) and cloud-native practices:
- The latest model artifact is automatically loaded by our cloud infrastructure on startup.
- There is no need to rebuild or redeploy Docker images for every model update.
- This decouples model deployment from application builds, allowing for faster iterations and rollback capabilities.
- β Ensures high model performance before deployment
- β Reduces deployment time and resource overhead
- β Encourages responsible ML practices with human validation
- β Simplifies infrastructure with dynamic model loading
This approach balances automation with accountability, aligning with real-world ML ops best practices.
-
Data Loading Errors
- Check file paths in
config.yaml - Verify data format and column names
- Ensure proper datetime indexing
- Check file paths in
-
Memory Issues
- Reduce batch size for LSTM training
- Use data chunking for large datasets
- Monitor memory usage during processing
-
Model Performance
- Check feature engineering pipeline
- Verify target column name format
- Review hyperparameter settings
-
W&B Connection Issues
- Verify API key:
wandb login - Check internet connectivity
- Review project permissions
- Verify API key:
- Feature Selection: Use domain knowledge for feature engineering
- Hyperparameter Tuning: Grid search or Bayesian optimization
- Data Quality: Ensure clean, consistent input data
- Model Selection: Choose appropriate algorithm for data characteristics
Recent performance restoration includes:
- Enhanced Feature Engineering: AQI tiers, cyclical encoding, weekend/night indicators
- Robust Data Processing: Better column handling, weather code encoding
- Improved Target Handling: Multiple column name format support
- Extended Rolling Windows: Additional temporal feature scales
See PERFORMANCE_RESTORATION_SUMMARY.md for detailed analysis.
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-feature - Make changes and add tests
- Update documentation as needed
- Submit a pull request
- Follow PEP 8 style guidelines
- Add type hints to new functions
- Include docstrings with examples
- Test changes with different model types
- Update configuration documentation
Note: This project is designed for educational and research purposes in air quality forecasting. For production use, additional validation and testing are recommended.

