- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1
Getting started : DEVELOPERS (In development)
To get started you need to first clone csiem-data repository and secondly access data-lake and data-warehouse from storage
git clone [email protected]:AquaticEcoDynamics/csiem-data
This is approximately 2GB in size
For more information see https://github.com/AquaticEcoDynamics/csiem-data/wiki/Access
Passwords in dropbox
Data lake is approximately 80GB in size
Data warehouse is approximately 90GB in size
co-locate the data folder
csiem-data/
  code/
  data-governance/
  data-mapping/
  data-warehouse/
  data-lake/
The CSIEM data pipeline consists of a modular process that ingests, filters, and transforms environmental monitoring data from the data lake into a structured data warehouse. The output is a collection of .csv, .mat and .parquet files organized by agency and suitable for MATLAB-based visualization tools such as MARVL.
the pipeline requires the following .mat files for metadata, typically located in csiem-data/code/actions/:
- 
agency.mat 
- 
varkey.mat 
- 
sitekey.mat 
These contain reference mappings for agency codes, variable keys, and site identifiers. Raw environmental observations are sourced from the data-lake/ folder in CSV or .mat format.
The core processing steps are:
Load raw time series from data-lake/AGENCY_NAME/*.csv.
- Drop missing or invalid entries
- Harmonize timestamps, variable names, units
Each agency-specific output contains three synchronized versions of the data:
- data.csv – standard delimited format
- data.parquet – optimized for analytical processing and columnar access
- data.mat – compatible with MATLAB-based tools such as MARVL
 Aquatic EcoDynamics
          Aquatic EcoDynamics
