You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet
11
+
AODN Cloud Optimised library allows to convert oceanographic datasets from [IMOS (Integrated Marine Observing System)](https://imos.org.au/) / [AODN (Australian Ocean Data Network)](https://portal.aodn.org.au/) into cloud-optimised formats such as [Zarr](https://zarr.readthedocs.io/) (for gridded multidimensional data) and [Parquet](https://parquet.apache.org/docs/) (for tabular data).
13
12
14
13
## Documentation
15
14
@@ -19,35 +18,56 @@ Visit the documentation on [ReadTheDocs](https://aodn-cloud-optimised.readthedoc
19
18
20
19
## Key Features
21
20
22
-
* Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
23
-
* YAML configuration approach with parent and child YAML configuration if multiple dataset are very similar (i.e. Radar ACORN, GHRSST, see [config](https://github.com/aodn/aodn_cloud_optimised/tree/main/aodn_cloud_optimised/config/dataset))
24
-
* Generic handlers for most dataset ([GenericParquetHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericParquetHandler.py), [GenericZarrHandler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/GenericZarrHandler.py)).
25
-
* Specific handlers can be written and inherits methods from a generic handler ([Argo handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/ArgoHandler.py), [Mooring Timseries Handler](https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/lib/AnmnHourlyTsHandler.py))
26
-
* Clustering capability:
27
-
* Local dask cluster
28
-
* Remote Coiled cluster
29
-
* driven by configuration/can be easily overwritten
30
-
* Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
31
-
* Parquet: tabular files are done in batch and in parallel as independent task, done with future
32
-
* Reprocessing:
33
-
* Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
34
-
* Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
35
-
* Chunking:
36
-
* Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
37
-
* Zarr: done via dataset configuration
38
-
* Metadata:
39
-
* Parquet: Metadata is created as a sidecar _metadata.parquet file
40
-
* Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files
21
+
### Data Conversion
22
+
- Convert **CSV** or **NetCDF** (single or multidimensional) to **Zarr** or **Parquet**.
23
+
-**Dataset configuration:** YAML-based configuration with inheritance, allowing similar datasets to share settings.
- Cluster behaviour is configuration-driven and can be easily overridden.
39
+
- Automatic restart of remote cluster upon Dask failure.
40
+
-**Zarr:** Gridded datasets are processed in batch and in parallel using [`xarray.open_mfdataset`](https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html).
41
+
-**Parquet:** Tabular files are processed in batch and in parallel as independent tasks, implemented with `concurrent.futures.Future`.
42
+
-**S3 / S3-Compatible Storage Support:**
43
+
Support for AWS S3 and S3-compatible endpoints (e.g., MinIO, LocalStack) with configurable input/output buckets and authentication via `s3fs` and `boto3`.
44
+
### Reprocessing
45
+
-**Zarr:** Reprocessing is achieved by writing to specific slices, including non-contiguous regions.
46
+
-**Parquet:** Reprocessing uses PyArrow internal overwriting; can also be forced when input files change significantly.
47
+
48
+
### Chunking & Partitioning
49
+
- Improves performance for querying and parallel processing.
50
+
-**Parquet:** Partitioned by polygon and timestamp slices. [Issue reference](https://github.com/aodn/aodn_cloud_optimised/issues/240)
51
+
-**Zarr:** Chunking is defined in dataset configuration.
52
+
53
+
### Dynamic Variable Definition
54
+
See [doc](https://aodn-cloud-optimised.readthedocs.io/en/latest/development/dataset-configuration.html#adding-variables-dynamically)
55
+
- Global Attributes -> variable
56
+
- variable attribute -> variable
57
+
- filename part -> variable
58
+
- ...
59
+
60
+
### Metadata
61
+
-**Parquet:** Metadata stored as a sidecar `_metadata.parquet` file for faster queries and schema discovery.
41
62
42
63
43
64
# Quick Guide
44
65
## Installation
45
66
46
67
Requirements:
47
68
* Python >= 3.10.14
48
-
* AWS SSO to push files to S3
49
-
* An account on [Coiled](https://cloud.coiled.io/) for remote clustering (Optional)
50
-
69
+
* AWS SSO configured for pushing files to S3
70
+
* Optional: [Coiled](https://cloud.coiled.io/) account for remote clustering
51
71
52
72
### Automatic installation of the latest wheel release
53
73
```bash
@@ -62,8 +82,8 @@ See [ReadTheDocs - Dev](https://aodn-cloud-optimised.readthedocs.io/en/latest/de
62
82
## Usage
63
83
See [ReadTheDocs - Usage](https://aodn-cloud-optimised.readthedocs.io/en/latest/usage.html)
A curated list of Jupyter [Notebooks](https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/) ready to be loaded in Google Colab and Binder. Click on the badge above
89
+
A curated list of Jupyter [Notebooks](https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/) ready to be loaded in Google Colab and Binder for users to play with IMOS/AODN converted to Cloud Optimised dataset. Click on the badge above
0 commit comments