-
Couldn't load subscription status.
- Fork 1
Access
The csiem-data GitHub repository hosts the import and processing code, keys and the catalogues, but does not host the bulk data storage. To access the repository, use a client like GitHub Desktop, or download from the command line via:
git clone [email protected]:SEAF-CS/csiem-data
Once you have the main repository, follow the below steps to source the data, and include it in the csiem-data directory.
All raw and processed data is stored on Pawsey Supercomputing Centre's Acacia S3 storage, within the "csiem" bucket. Raw data is stored in the data-lake directory, and all processed data is stored in the data-warehouse directory.
A Data Lake is simply a centralised store of raw, disparate datasets, be it structured or unstructured. Centralising and cataloguing this raw data allows for customised ETL processes to be constructed that can be tailored to each analytics usecase, as opposed to forcing the data into a one size fits all structure.
The Data Warehouse is the store of boutique, customised data products produced through the batch ETL (Extract, Transform and Load) pipelines. Each folder within the data-warehouse directory on GitHub contains processed data in different formats, based upon end user requirements.
All data usage outside of the ETL pipelines should and must be carried out from products within the warehouse. This will ensure both the efficient and repeatability of any script or data product produced downstream, as well as providing a constant data pathway for data validation.
There are currently three separate directories within the data-warehouse:
- csv
- mat
- parquet
- data-images
The csv directory contains data that has been directly imported from the data lake, standardised based on the variable catalogue. Data is separated into different site, variable and in some cases sampling campaigns.
The mat directory contains different Matlab mat binary files.
- seaf.mat: All data variables in the units found in the variable catalogue
- cockburn.mat: Subset of variable required for the TFV-AED model, converted into the units in the TFV catalogue.
A collection of raw data summary sheets for each ingested site/variable.
An Access Key ID and Secret are requried to be able to access the data. They can be obtained by contacting Brendan Busch ([email protected]).
Data can be accessed a variety of ways, as the Pawsey storage is based upon Amazons S3 storage protocol. Below is a how-to for WinSCP.
Open WINSCP.
A login window should pop up or select Session -> Start New Session
Select new site -> File Protocol -> Amazon S3
In the Host Name enter: projects.pawsey.org.au
Port Number 443 (should be default)
Enter the Access Key ID and Secret access key below (see Brendan Busch for access).
In the advanced menu ->Environment -> S3
Change URL style to ‘Path’
To create a bucket, follow the instruction noting naming conventions:
https://support.pawsey.org.au/documentation/display/US/Acacia+-+Introduction
Aquatic EcoDynamics