Skip to content
Matt Hipsey edited this page Oct 4, 2025 · 10 revisions

Repository Access

The csiem-data GitHub repository hosts the import and processing code, keys and the catalogues, but does not host the bulk data storage. To access the repository, use a client like GitHub Desktop, or download from the command line via:

git clone [email protected]:SEAF-CS/csiem-data

Once you have the main repository, follow the below steps to source the data, and include it in the csiem-data directory.


Data Storage

All raw and processed data is stored on Pawsey Supercomputing Centre's Acacia S3 storage, within the "csiem" bucket. Raw data is stored in the data-lake directory, and all processed data is stored in the data-warehouse directory.

Data Lake

A Data Lake is simply a centralised store of raw, disparate datasets, be it structured or unstructured. Centralising and cataloguing this raw data allows for customised ETL processes to be constructed that can be tailored to each analytics usecase, as opposed to forcing the data into a one size fits all structure.

Data Warehouse

The Data Warehouse is the store of boutique, customised data products produced through the batch ETL (Extract, Transform and Load) pipelines. Each folder within the data-warehouse directory on GitHub contains processed data in different formats, based upon end user requirements.

All data usage outside of the ETL pipelines should and must be carried out from products within the warehouse. This will ensure both the efficient and repeatability of any script or data product produced downstream, as well as providing a constant data pathway for data validation.

There are currently three separate directories within the data-warehouse:

  • csv
  • mat
  • parquet
  • data-images

CSV

The csv directory contains data that has been directly imported from the data lake, standardised based on the variable catalogue. Data is separated into different site, variable and in some cases sampling campaigns.

MAT

The mat directory contains different Matlab mat binary files.

  • seaf.mat: All data variables in the units found in the variable catalogue
  • cockburn.mat: Subset of variable required for the TFV-AED model, converted into the units in the TFV catalogue.

data-images

A collection of raw data summary sheets for each ingested site/variable.


Data Access

An Access Key ID and Secret are requried to be able to access the data. They can be obtained by contacting Brendan Busch ([email protected]).

Data can be accessed a variety of ways, as the Pawsey storage is based upon Amazons S3 storage protocol. Below is a how-to for WinSCP.

WinSCP

Open WINSCP.

A login window should pop up or select Session -> Start New Session

Select new site -> File Protocol -> Amazon S3

In the Host Name enter: projects.pawsey.org.au

Port Number 443 (should be default)

Enter the Access Key ID and Secret access key below (see Brendan Busch for access).

In the advanced menu ->Environment -> S3

Change URL style to ‘Path’

To create a bucket, follow the instruction noting naming conventions:

https://support.pawsey.org.au/documentation/display/US/Acacia+-+Introduction

CSIEM Data Wiki

Overview

Governance

Vocabularies

Storage & Access

Data Overview

Maps (NOTE: may not be current)

Clone this wiki locally