Skip to content

Commit 5f6e33b

Browse files
authored
Update README.md
1 parent ed79974 commit 5f6e33b

File tree

1 file changed

+47
-27
lines changed

1 file changed

+47
-27
lines changed

README.md

Lines changed: 47 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# AODN Cloud Optimised Conversion
1+
# AODN (Australian Ocean Data Network) Cloud Optimised library
22

33
![Build Status](https://github.com/aodn/aodn_cloud_optimised/actions/workflows/build.yml/badge.svg)
44
![Test Status](https://github.com/aodn/aodn_cloud_optimised/actions/workflows/test-mamba.yml/badge.svg)
@@ -8,8 +8,7 @@
88
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aodn/aodn_cloud_optimised/)
99
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/aodn/aodn_cloud_optimised/main?filepath=notebooks)
1010

11-
12-
A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet
11+
AODN Cloud Optimised library allows to convert oceanographic datasets from [IMOS (Integrated Marine Observing System)](https://imos.org.au/) / [AODN (Australian Ocean Data Network)](https://portal.aodn.org.au/) into cloud-optimised formats such as [Zarr](https://zarr.readthedocs.io/) (for gridded multidimensional data) and [Parquet](https://parquet.apache.org/docs/) (for tabular data).
1312

1413
## Documentation
1514

@@ -19,35 +18,56 @@ Visit the documentation on [ReadTheDocs](https://aodn-cloud-optimised.readthedoc
1918

2019
## Key Features
2120

22-
* Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
23-
* YAML configuration approach with parent and child YAML configuration if multiple dataset are very similar (i.e. Radar ACORN, GHRSST, see [config](https://github.com/aodn/aodn_cloud_optimised/tree/main/aodn_cloud_optimised/config/dataset))
24-
* Generic handlers for most dataset ([GenericParquetHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericParquetHandler.py), [GenericZarrHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericZarrHandler.py)).
25-
* Specific handlers can be written and inherits methods from a generic handler ([Argo handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/ArgoHandler.py), [Mooring Timseries Handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/AnmnHourlyTsHandler.py))
26-
* Clustering capability:
27-
* Local dask cluster
28-
* Remote Coiled cluster
29-
* driven by configuration/can be easily overwritten
30-
* Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
31-
* Parquet: tabular files are done in batch and in parallel as independent task, done with future
32-
* Reprocessing:
33-
* Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
34-
* Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
35-
* Chunking:
36-
* Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
37-
* Zarr: done via dataset configuration
38-
* Metadata:
39-
* Parquet: Metadata is created as a sidecar _metadata.parquet file
40-
* Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files
21+
### Data Conversion
22+
- Convert **CSV** or **NetCDF** (single or multidimensional) to **Zarr** or **Parquet**.
23+
- **Dataset configuration:** YAML-based configuration with inheritance, allowing similar datasets to share settings.
24+
Example: [Radar ACORN](https://github.com/aodn/aodn_cloud_optimised/tree/main/aodn_cloud_optimised/config/dataset), [GHRSST](https://www.ghrsst.org/).
25+
- Semi-automatic creation of dataset configuration: [ReadTheDocs guide](https://aodn-cloud-optimised.readthedocs.io/en/latest/development/dataset-configuration.html#create-dataset-configuration-semi-automatic).
26+
- Generic handlers for standard datasets:
27+
[GenericParquetHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericParquetHandler.py),
28+
[GenericZarrHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericZarrHandler.py)
29+
- Custom handlers can inherit from generic handlers:
30+
[Argo handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/ArgoHandler.py),
31+
[Mooring Timeseries Handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/AnmnHourlyTsHandler.py)
32+
33+
### Clustering & Parallel Processing
34+
- Supports local **Dask cluster** and remote clusters:
35+
- [Coiled](https://aodn-cloud-optimised.readthedocs.io/en/latest/development/dataset-configuration.html#coiled-cluster-configuration)
36+
- [EC2](https://aodn-cloud-optimised.readthedocs.io/en/latest/development/dataset-configuration.html#ec2-cluster-configuration)
37+
- Fargate cluster
38+
- Cluster behaviour is configuration-driven and can be easily overridden.
39+
- Automatic restart of remote cluster upon Dask failure.
40+
- **Zarr:** Gridded datasets are processed in batch and in parallel using [`xarray.open_mfdataset`](https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html).
41+
- **Parquet:** Tabular files are processed in batch and in parallel as independent tasks, implemented with `concurrent.futures.Future`.
42+
- **S3 / S3-Compatible Storage Support:**
43+
Support for AWS S3 and S3-compatible endpoints (e.g., MinIO, LocalStack) with configurable input/output buckets and authentication via `s3fs` and `boto3`.
44+
### Reprocessing
45+
- **Zarr:** Reprocessing is achieved by writing to specific slices, including non-contiguous regions.
46+
- **Parquet:** Reprocessing uses PyArrow internal overwriting; can also be forced when input files change significantly.
47+
48+
### Chunking & Partitioning
49+
- Improves performance for querying and parallel processing.
50+
- **Parquet:** Partitioned by polygon and timestamp slices. [Issue reference](https://github.com/aodn/aodn_cloud_optimised/issues/240)
51+
- **Zarr:** Chunking is defined in dataset configuration.
52+
53+
### Dynamic Variable Definition
54+
See [doc](https://aodn-cloud-optimised.readthedocs.io/en/latest/development/dataset-configuration.html#adding-variables-dynamically)
55+
- Global Attributes -> variable
56+
- variable attribute -> variable
57+
- filename part -> variable
58+
- ...
59+
60+
### Metadata
61+
- **Parquet:** Metadata stored as a sidecar `_metadata.parquet` file for faster queries and schema discovery.
4162

4263

4364
# Quick Guide
4465
## Installation
4566

4667
Requirements:
4768
* Python >= 3.10.14
48-
* AWS SSO to push files to S3
49-
* An account on [Coiled](https://cloud.coiled.io/) for remote clustering (Optional)
50-
69+
* AWS SSO configured for pushing files to S3
70+
* Optional: [Coiled](https://cloud.coiled.io/) account for remote clustering
5171

5272
### Automatic installation of the latest wheel release
5373
```bash
@@ -62,8 +82,8 @@ See [ReadTheDocs - Dev](https://aodn-cloud-optimised.readthedocs.io/en/latest/de
6282
## Usage
6383
See [ReadTheDocs - Usage](https://aodn-cloud-optimised.readthedocs.io/en/latest/usage.html)
6484

65-
## Notebooks
85+
## Getting Started - Notebooks
6686
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aodn/aodn_cloud_optimised/)
6787
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/aodn/aodn_cloud_optimised/main?filepath=notebooks)
6888

69-
A curated list of Jupyter [Notebooks](https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/) ready to be loaded in Google Colab and Binder. Click on the badge above
89+
A curated list of Jupyter [Notebooks](https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/) ready to be loaded in Google Colab and Binder for users to play with IMOS/AODN converted to Cloud Optimised dataset. Click on the badge above

0 commit comments

Comments
 (0)