-
Couldn't load subscription status.
- Fork 1
Getting started : DEVELOPERS (In development)
To get started you need to first clone csiem-data repository and secondly access data-lake and data-warehouse from storage
git clone [email protected]:AquaticEcoDynamics/csiem-data
This is approximately 2GB in size
For more information see https://github.com/AquaticEcoDynamics/csiem-data/wiki/Access
Passwords in dropbox
Data lake is approximately 80GB in size
Data warehouse is approximately 90GB in size
co-locate the data folder
csiem-data/
code/
data-governance/
data-mapping/
data-warehouse/
data-lake/
The CSIEM data pipeline consists of a modular process that ingests, filters, and transforms environmental monitoring data from the data lake into a structured data warehouse. The output is a collection of .csv, .mat and .parquet files organized by agency and suitable for MATLAB-based visualization tools such as MARVL.
the pipeline requires the following .mat files for metadata, typically located in csiem-data/code/actions/:
-
agency.mat
-
varkey.mat
-
sitekey.mat
These contain reference mappings for agency codes, variable keys, and site identifiers. Raw environmental observations are sourced from the data-lake/ folder in CSV or .mat format.
The core processing steps are:
Load raw time series from data-lake/AGENCY_NAME/*.csv.
- Drop missing or invalid entries
- Harmonize timestamps, variable names, units
Each agency-specific output contains three synchronized versions of the data:
- data.csv – standard delimited format
- data.parquet – optimized for analytical processing and columnar access
- data.mat – compatible with MATLAB-based tools such as MARVL
Aquatic EcoDynamics