Getting started : DEVELOPERS (In development)

Overview

To get started you need to first clone csiem-data repository and secondly access data-lake and data-warehouse from storage

Sourcing repository

git clone [email protected]:AquaticEcoDynamics/csiem-data

This is approximately 2GB in size

Accessing data from storage

For more information see https://github.com/AquaticEcoDynamics/csiem-data/wiki/Access

Passwords in dropbox

Data lake is approximately 80GB in size

Data warehouse is approximately 90GB in size

Organising your working folder

co-locate the data folder

csiem-data/
  code/
  data-governance/
  data-mapping/
  data-warehouse/
  data-lake/

Pipeline process

The CSIEM data pipeline consists of a modular process that ingests, filters, and transforms environmental monitoring data from the data lake into a structured data warehouse. The output is a collection of .csv, .mat and .parquet files organized by agency and suitable for MATLAB-based visualization tools such as MARVL.

Input Files Required

the pipeline requires the following .mat files for metadata, typically located in csiem-data/code/actions/:

agency.mat
varkey.mat
sitekey.mat

These contain reference mappings for agency codes, variable keys, and site identifiers. Raw environmental observations are sourced from the data-lake/ folder in CSV or .mat format.

Change data/code

Workflow Overview

The core processing steps are:

Read raw data:

Load raw time series from data-lake/AGENCY_NAME/*.csv.

Clean and filter:

Drop missing or invalid entries
Harmonize timestamps, variable names, units

Output Structure

Each agency-specific output contains three synchronized versions of the data:

data.csv – standard delimited format
data.parquet – optimized for analytical processing and columnar access
data.mat – compatible with MATLAB-based tools such as MARVL

Aquatic EcoDynamics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Getting started : DEVELOPERS (In development)

Overview

Sourcing repository

Accessing data from storage

Organising your working folder

Pipeline process

Input Files Required

Change data/code

Workflow Overview

Read raw data:

Clean and filter:

Output Structure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CSIEM Data Wiki

Overview

Governance

Vocabularies

Storage & Access

Data Overview

Maps (NOTE: may not be current)

Clone this wiki locally