This project builds a predictive model for CO₂ concentrations recorded across Prince Edward Island (PEI) and New Brunswick (NB) using environmental sensor data. The notebook ingests the combined dataset, cleans it, selects province-specific predictors, and benchmarks two modeling pipelines (Linear Regression and XGBoost with five-fold cross-validation) to quantify how well each province’s emissions can be reconstructed.
- Source:
dataset.xlsx, sheetCombined, which aggregates 528 time-stamped observations of soil (moisture, temperature, EC) and ambient (air temperature, humidity, wind speed, dew point, precipitation) variables plus greenhouse-gas concentrations. - Columns include
Date,Sr. #, the sensor readings listed above,CO2,N2O,CH4,H2O, and two empty placeholder columns. The notebook appends aProvincelabel (first 408 entries → PEI, next 120 → NB) before dropping the irrelevant metadata columns. - After dropping the unused columns and removing detected CO₂ outliers, the modeling dataset contains 516 rows and 13 columns (the eight sensor features,
CO2, the three other gases, andProvince).
- Numeric inspection uses the Z-score routine to count outliers per feature for both provinces, which guided the choice of cleanup strategy.
- CO₂ outliers were trimmed differently per province: an IQR-based filter removed 11 PEI points, while a Z-score threshold (|z| > 2.5) removed 1 NB point. The cleaned partitions were concatenated for downstream modeling.
- Remaining rows are split again by province so that the modeling loop trains on PEI and NB independently while reusing the same pipeline helpers.
- Candidate features:
Soil Mositure,Soil Temperature,Soil EC,Air Temperature [°C],Precipitation [mm],Relative Humidity [%],Wind Speed [m/s],Dew Point [°C]. - Selected predictors (per the correlation and visual inspection steps):
- PEI:
Dew Point [°C],Air Temperature [°C],Soil Temperature - NB:
Air Temperature [°C],Dew Point [°C],Soil Temperature
- PEI:
- The notebook also maintains
CO2as the target and keeps the remaining gases (N2O,CH4,H2O) for contextual analysis and potential multi-target extensions.
- Both Linear Regression and XGBoost regressors are evaluated with five-fold cross-validation (
KFold(n_splits=5, shuffle=True, random_state=42)), ensuring every split reports MAE, MSE, RMSE, and R². - Custom helpers collect predictions, record the averaged metric ± standard deviation, and plot measured vs. predicted / residual diagnostics to assess calibration per province and model.
- XGBoost columns are sanitized (square/bracket characters removed) before training to avoid regex-style column name issues.
| Model | Province | MAE | MSE | RMSE | R² |
|---|---|---|---|---|---|
| Linear Regression | PEI | 0.93 ± 0.10 | 1.38 ± 0.29 | 1.17 ± 0.13 | 0.40 ± 0.09 |
| Linear Regression | NB | 0.73 ± 0.08 | 0.89 ± 0.20 | 0.94 ± 0.10 | 0.64 ± 0.10 |
| XGBoost | PEI | 0.73 ± 0.06 | 1.05 ± 0.22 | 1.02 ± 0.10 | 0.54 ± 0.11 |
| XGBoost | NB | 0.90 ± 0.13 | 1.21 ± 0.23 | 1.09 ± 0.11 | 0.51 ± 0.11 |
- Interpretation: Linear Regression already captures NB’s behavior quite well (R² 0.64) but offers weaker generalization for PEI (0.40). XGBoost lifts PEI’s R² to 0.54 while slightly degrading NB’s score, suggesting that nonlinear interactions help where the data volume is richer (PEI) but may overfit the smaller NB partition.
-
Install the core dependencies (the notebook already imports these packages):
python -m pip install --user pandas numpy scipy matplotlib seaborn scikit-learn xgboost notebook openpyxl
-
Launch the notebook server and open the analysis:
jupyter notebook FinalProject.ipynb
-
Re-run the cells to regenerate figures or tune the model (the dataset file must stay next to the notebook).
- Structure:
.
├── data/ # Raw and processed spreadsheets (e.g., dataset.xlsx)
├── notebooks/ # The analyzed notebook(s) such as FinalProject.ipynb
├── README.md
└── scripts/ # Standalone helpers / conversions if needed
- Consider future work:
- Merge the NB & PEI partitions into a single multi-output model with province-aware features.
- Hyperparameter tune XGBoost (grid/random search) and explore regularized linear models.
- Add a deployment-ready pipeline (data validation, serialization, scoring) or a dashboard to monitor predictions.