This project develops machine learning models to predict key dissolved gas concentrations (primarily methane and hydrogen) at the Southern Hydrate Ridge methane seep site using multimodal geophysical data from the Ocean Observatories Initiative (OOI) Regional Cabled Array (RCA). By mapping robust physical measurements (seismometers, pressure sensors, and an acoustic Doppler current profiler) to residual gas analyzer–derived concentrations, we aim to fill instrument downtime and data gaps in seep chemistry. This README provides a concise technical overview; see docs/writeup.md for the full narrative description and figures.
Southern Hydrate Ridge, located offshore Oregon, is a methane seep site of significant biogeochemical importance. The site features active methane venting, gas hydrate deposits, and complex fluid flow dynamics. Understanding temporal variations in methane (CH₄) and hydrogen (H₂) concentrations is crucial for:
- Quantifying methane flux to the ocean and atmosphere
- Understanding subsurface biogeochemical processes
- Characterizing seep dynamics and episodic venting events
- Assessing climate-relevant greenhouse gas emissions
However, mass spectrometer instruments require frequent maintenance and calibration, leading to data gaps. This project addresses these gaps by training models on reliable geophysical proxies.
All data are collected from the OOI RCA infrastructure at Southern Hydrate Ridge:
-
Seismic Data
- Station: OO.HYS14
- Channels: HHZ, HHN, HHE (200 Hz sampling rate)
- Three-component broadband seismometer data
- Sensitive to ground motion, tremor, and fluid flow dynamics
-
Acoustic Data
- Instrument: ADCP (Acoustic Doppler Current Profiler)
- Dataset: RS01SUM2-MJ01B-12-ADCPSK101
- Measures bubble plume velocity and water column dynamics
- Hourly measurements of current profiles
-
Pressure Data
- Bottom pressure recorders (BPR)
- Sensitive to tidal loading and seasonality trends
- Used to detrend average bubble plume velocity
- Residual Gas Analyzer Data
- Methane (CH₄) concentration (nM/L)
- Hydrogen (H₂) concentration (nM/L)
- Hydrogen Sulfide (H₂S) concentration (mM/L)
- Nitrogen (N₂) concentration (nM/L)
- Oxygen (O₂) concentration (nM/L)
- Carbon dioxide (CO₂) concentration (nM/L)
- Derived from partial pressure measurements using Henry's Law
- Ex:
methane_concentration_2017.csv
All seismic are aligned to 22-second time windows corresponding to RGA measurement timestamps, while ADCP measurements are on an hour long basis.
For each 22-second window:
- Download: Query IRIS FDSN web services for three-component data
- Detrend: Remove linear trends using ObsPy
- Taper: Apply 5% Hanning window to edges
- Highpass Filter: Highpass filter at 2 Hz cutoff
- Removes microseism noise and long-period ocean loading
- Focuses on signals related to fluid flow and bubble dynamics
- Quality Control: Check for gaps, NaN values, and data quality
- Load NetCDF: Read hourly ADCP velocity data
- Temporal Extraction: For each 22-second window, extract all ADCP measurements - may replicate many measurements across windows due to hourly binning constraint
- Bin Averaging: If multiple depth bins present, compute mean across bins
- Unit Conversion: Convert m/s to cm/s where appropriate
All chmeical concentrations are calculated from mass spectrometer partial pressure measurements using Henry's Law:
C = K_H × P_partial
Where:
- C = dissolved gas concentration (nM/L)
- K_H = Henry's Law constant (temperature and depth dependent - unique for each concentration, courtesy of NOAA)
- P_partial = partial pressure measured by mass spectrometer
This project initially implemented two complementary modeling strategies, each suited to different aspects of the data:
Training Data Format: Extracted statistical and spectral features
Feature Extraction (per 22-second window):
Seismic Features (per component: Z, N, E):
- Mean amplitude
- Spectral features: top dominant frequencies and their power
- Average frequency and amplitude across all components
- Total: 11
Acoustic Features:
- Mean velocity in 22-second window
- Maximum velocity
- Minimum velocity
- Median velocity
Model Architecture:
- Algorithm: Random Forest Regressor (scikit-learn)
- Configuration:
- 800 trees
- Max depth: 12
- Min samples leaf: 2
- Max features: 'sqrt'
- Validation: 5-fold cross-validation
Advantages:
- Interpretable feature importances
- Handles non-linear relationships
- Robust to outliers
- Fast training and inference
Training Data Format: Raw time series arrays (post-filtering)
Data Structure (per 22-second window):
- Seismic: 3 channels × 4400 samples (200 Hz × 22 sec)
- Acoustic: Variable samples (hourly measurements interpolated)
Model Architecture:
- Framework: PyTorch
- Architecture: 1D CNN Regressor
- Validation: 5-fold cross-validation
Advantages:
- Captures temporal patterns and phase relationships
- Learns hierarchical features automatically
- Better for sequential dependencies
- No manual feature engineering required
- Training set: 70% of data (further split in 5-fold CV)
- Validation sets: 15% of data, 5 folds for cross-validation
- Test set: 15% held-out data for final evaluation
- RMSE (Root Mean Squared Error): Prediction accuracy
- R² (Coefficient of Determination): Explained variance
- MAE (Mean Absolute Error): Average prediction error
- Cross-validation R²: 90%
- Test set R²: ~90%
- Test set residual: ~85%
Top Feature Importances:
- Dominant spectral power from seismic East component (E_power1) - 39%
- Dominant spectral frequency from seismic East component (E_freq1) - 10%
- Average amplitude from seismic East component (E_mean) - 9%
- Median bubble plume velocity from acoustic velocity profiler (adcp_median) - 9%
- Mean bubble plume velocity from acoustic velocity profiler (adcp_mean) - 9%
The CNN model resulted in poor performance on raw time series data, while the RF showed robust performance. We opted to use the RF model in our final analysis.
MLGEO_2026_Methane_Seeps/
├── notebooks/ # Analysis and modeling notebooks
│ ├── random_forest_features.ipynb # Feature-based RF model
│ ├── time_series_cnn.ipynb # CNN on raw time series
│ ├── shr_seismicity*.ipynb # Seismic and short-duration event exploration
│ ├── MSDataToCSV.py # MASSPA data-to-CSV utilities
│ └── ... # Additional data pull and figure notebooks
├── figures/ # Exported figures for writeup/docs
├── docs/
│ ├── README.md # This file (project overview)
│ ├── writeup.md # Full scientific writeup
│ ├── environment.yml # Conda environment
│ └── requirements.txt # pip requirements
│ ├── MSDataCollectorV1/ # Raw data collection utilities
│ ├── pyproject.toml # Python project metadata
│ └── uv.lock # Locked dependency versions
## Dependencies
This project’s environment is defined in `docs/environment.yml` (Conda) and `docs/requirements.txt` (pip). The `pyproject.toml` in `MLGEO_2026_Methane_Seeps/docs/` provides a lightweight dependency specification for the subproject.
### Recommended Python
- Python >= 3.8
### Core Packages
- Scientific computing: `numpy`, `pandas`, `scipy`
- Seismic and NetCDF I/O: `obspy`, `netCDF4`, `cftime`, `xarray`
- Machine learning: `scikit-learn`, `torch`/`pytorch`, `torchvision`
- Visualization: `matplotlib`, `seaborn`
- Notebooks and tooling: `jupyter`, `ipykernel`, `notebook`, `jupyterlab`, `tqdm`, `statsmodels`
- Web and utilities: `requests`, `bs4`, `datetime`
To reproduce the environment with Conda:
```bash
conda env create -f docs/environment.yml
conda activate mlgeo-methane-seeps
Or with pip (inside a virtual environment):
pip install -r docs/requirements.txt- IRIS FDSN Web Services: Seismic waveform data
- Client:
obspy.clients.fdsn.Client("IRIS") - Networks: OO (Ocean Observatories Initiative)
- Client:
- Continuous prediction of CH₄ concentrations during instrument downtime
- Improved temporal resolution for flux calculations
- Better characterization of episodic venting events
- Correlation between seismic tremor and chemical release
- Relationship between bottom currents and plume dispersal
- Pressure-driven fluid migration patterns
- Real-time prediction of elevated methane concentrations
- Detection of anomalous venting events
- Integration with autonomous sampling strategies
- Temporal Coverage: Training data limited to 2017 when all instruments operational
- Spatial Coverage: Single seismic station (HYS14)
- ADCP Sampling: Hourly measurements limit temporal resolution
- Seasonal Bias: May not capture full range of environmental conditions
- Multi-year Training: Incorporate data from 2014-2020
- Multi-station Ensemble: Use all HYS1* stations, even at 8Hz sampling, to expand coverage for low-frequency events.
- Additional Features:
- Temperature and salinity from CTD sensors
- Tidal phase and magnitude
- Transfer Learning: Apply models to other seep sites
- Real-time Deployment: Implement models in OOI cyberinfrastructure
- Physics-informed ML: Incorporate fluid dynamics constraints
Christina Stuhl David Lovett Michael Hemmett Isaac Olson
- Ocean Observatories Initiative (OOI): Data infrastructure and access
- EarthScope Consortium: Seismic waveform services
- University of Washington Earth and Space Sciences: Computational resources
- Dr. Akshay Mehra and Henry Yuan: Guidance and project feedback
MacLeod, L. M. F., & Wilcock, W. S. D. (2025). Nonseismic short-duration events offshore Cascadia: Characteristics and potential origin. Seismological Research Letters, 96(2A), 706–720. https://doi.org/10.1785/0220240367
Marcon, Y., Kelley, D. S., Thornton, B., Manalang, D., & Bohrmann, G. (2021). Variability of natural methane bubble release at Southern Hydrate Ridge. Geochemistry, Geophysics, Geosystems, 22, e2021GC009894. https://doi.org/10.1029/2021GC009894
Philip, B. T., A. R.Denny, E. A.Solomon, and D. S. Kelley (2016). Time-series measurements of bubble plume variability and water column methane distribution above Southern Hydrate Ridge, Oregon, Geochem. Geophys. Geosyst., 17, 1182–1196, doi:10.1002/2016GC006250.
Reeburgh, W. S. (2007). Oceanic methane biogeochemistry. Chemical Reviews, 107(2), 486–513. https://doi.org/10.1021/cr050362v
Römer, M., Sahling, H., Pape, T., Bahr, A., Feseker, T., Wintersteller, P., & Bohrmann, G. (2014). Microbial abundance and diversity patterns associated with sediments and carbonates from methane seep environments of Hydrate Ridge, Oregon. Frontiers in Marine Science, 1, Article 44. https://doi.org/10.3389/fmars.2014.00044
Sahling, H., Galkin, S. V., Salyuk, A., Greinert, J., Foerstel, H., Piepenburg, D., & Suess, E. (2002). Macrofaunal community structure and sulfide flux at gas hydrate deposits from the Cascadia convergent margin, NE Pacific. Marine Ecology Progress Series, 231, 121–138.
Treude, T., Boetius, A., Knittel, K., Wallmann, K., & Jørgensen, B. B. (2003). Anaerobic oxidation of methane above gas hydrates at Hydrate Ridge, NE Pacific Ocean. Marine Ecology Progress Series, 264, 1–14.
Tryon, M. D., Brown, K. M., & Torres, M. E. (2001). Complex flow patterns through Hydrate Ridge and their impact on seep biota. Geophysical Research Letters, 28(15), 2863–2866. https://doi.org/10.1029/2000GL012566
Luff, R., Wallmann, K., Aloisi, G., et al. (2008). Miniaturized biosignature analysis reveals implications for the formation of cold seep carbonates at Hydrate Ridge (off Oregon, USA). Biogeosciences, 5, 731–741.
Ocean Observatories Initiative. "Regional Cabled Array." https://oceanobservatories.org/
MIT open license
Last Updated: March 2026
Project Status: Not currently under development