Skip to content

Getting started : DEVELOPERS (In development)

santi727G edited this page Jul 24, 2025 · 1 revision

Overview

To get started you need to first clone csiem-data repository and secondly access data-lake and data-warehouse from storage

Sourcing repository

git clone [email protected]:AquaticEcoDynamics/csiem-data

This is approximately 2GB in size

Accessing data from storage

For more information see https://github.com/AquaticEcoDynamics/csiem-data/wiki/Access

Passwords in dropbox

Data lake is approximately 80GB in size

Data warehouse is approximately 90GB in size

Organising your working folder

co-locate the data folder

csiem-data/
  code/
  data-governance/
  data-mapping/
  data-warehouse/
  data-lake/

Pipeline process

The CSIEM data pipeline consists of a modular process that ingests, filters, and transforms environmental monitoring data from the data lake into a structured data warehouse. The output is a collection of .csv, .mat and .parquet files organized by agency and suitable for MATLAB-based visualization tools such as MARVL.

Input Files Required

the pipeline requires the following .mat files for metadata, typically located in csiem-data/code/actions/:

  • agency.mat

  • varkey.mat

  • sitekey.mat

These contain reference mappings for agency codes, variable keys, and site identifiers. Raw environmental observations are sourced from the data-lake/ folder in CSV or .mat format.

Change data/code

Workflow Overview

The core processing steps are:

Read raw data:

Load raw time series from data-lake/AGENCY_NAME/*.csv.

Clean and filter:

  • Drop missing or invalid entries
  • Harmonize timestamps, variable names, units

Output Structure

Each agency-specific output contains three synchronized versions of the data:

  • data.csv – standard delimited format
  • data.parquet – optimized for analytical processing and columnar access
  • data.mat – compatible with MATLAB-based tools such as MARVL

CSIEM Data Wiki

Overview

Governance

Vocabularies

Storage & Access

Data Overview

Maps (NOTE: may not be current)

Clone this wiki locally