A Java-based application for performing temporal aggregation and/or regridding of NetCDF files.
ncAggregate is part of the eReefs Platform developed by the Australian Institute of Marine Science (or AIMS), and primarily operates in conjunction with the Job Plannerπ and ncAnimate. Job Planner is responsible for planning the work to be performed by ncAggregate and ncAnimate, while ncAnimate creates visualisations from raw NetCDF files and products generated by ncAggregate.
The work to be performed by ncAggregate as part of the AIMS eReefs Platform is defined as Product Definitions in the eReefs Definitionsπ repository.
IMPORTANT!
The AIMS Knowledge Systems team is progressively open sourcing the infrastructure of the AIMS eReefs Platform. Some related repositories mentioned in this and other README files may not yet be generally available to the public. If you wish early access to other parts of the system as a collaborator, please contact the Knowledge Systems team at [email protected].
.
|-- env
| |-- dev <-- Scripts useful for development/testing.
|
|-- src
| |-- main
| |-- java <-- Source code for the application.
| |-- resources <-- Resources included in the packaged application.
| |-- test
| |-- java <-- Unit test cases.
| |-- resources <-- Resources referenced by test cases.
|
|-- cloudformation.yaml <-- Definition of AWS assets.
|-- Dockerfile <-- Definition of the Docker image to wrap the packaged application.
|-- Jenkinsfile <-- The Jenkins job definition for this project.
|-- maven-settings.xml <-- Maven-specific settings.
|-- pom.xml <-- The Maven definition file for this project.
|-- README.md <-- This file.
ncAggregate is a Java-based application that can be executed from the command-line (see "Java-based execution") on any computer with the necessary pre-requisites installed (currently Java 8 and NetCDF libraries).
During normal operation within the AIMS eReefs Platform, ncAggregate and it's pre-requisites are packaged as a Docker image that executes within an AWS ECS Cluster. This Docker image can be leveraged locally to execute ncAggregate from the command-line (see "Docker-based execution") without installing the pre-requisites (though the Docker runtime must be installed).
During normal operations, ncAggregate expects to obtain detailed instructions from a MongoDB database (see "MongoDB-based repository"). This is the configuration recommended for Production use, though a simple file based solution (see "File-based repository") can be used for development. The database/repository is populated from the ereefs-definitionsπ project and the Job Planner.
Furthermore, ncAggregate can operate as a stand-alone regridding utility (see "Stand-alone regridding").
Note: This scenario requires access to system components that have not yet been open-sourced.
- Start a MongoDB instance, such as that provided by the ereefs-vmπ project.
- Populate the MongoDB instance with Product Definitions. Refer to the ereefs-definitionsπ project.
- Populate the MongoDB instance with Metadata from downloaded files. This is normally performed by ereefs-download-manager.
- Build the list of Tasks by running ereefs-job-plannerπ.
Note: This scenario requires access to system components that have not yet been open-sourced.
- Create a directory to contain the files. The scripts in this project have a default expectation that this directory
be
/data/ereefs/filedb, however that can be overwritten. - Populate the DB directory with Product definitions. This is normally performed by ereefs definitionsπ.
- Populate the DB directory with Metadata from downloaded files. This is normally performed by ereefs-download-manager.
- Build the list of Tasks by running ereefs-job-plannerπ.
Note: the pre-requisites (Java 8 and NetCDF libraries) must be installed before ncAggregate can be run from the command-line as a Java-based application.
Package the ncAggregate code and dependencies as a JAR file using the following script:
$ <project root>/env/dev/maven-package.shWARNING: Maven requires access to packages hosted on Github. Unfortunately Github requires
credentials even though the projects are public. You will need to set the GITHUB_USERNAME and
GITHUB_TOKEN environment variables before executing maven-package.sh.
To run ncAggregate as a Java application against a file-based database at /data/ereefs/filedb:
$ <project root>/env/dev/java-run-filedb.shTo run ncAggregate as a Java application against a file-based database at /my-data:
$ <project root>/env/dev/java-run-filedb.sh -d /my-dataTo run ncAggregate as a Java application against a MongoDB-based database:
$ <project root>/env/dev/java-run-mongodb.shPackage the ncAggregate code and dependencies in a Docker image using the following scripts:
$ <project root>/env/dev/maven-package.sh
$ <project root>/env/dev/docker-build-image.shTo run ncAggregate as a Docker container against a file-based database at /data/ereefs/filedb:
$ <project root>/env/dev/docker-run-filedb.sh -t <task id>To run ncAggregate as a Docker container against a file-based database at /my-data:
$ <project root>/env/dev/docker-run-filedb.sh -d /my-data -t <task id>To run ncAggregate as a Docker container against a MongoDB-based database:
$ <project root>/env/dev/docker-run-mongodb.sh -t <task id>Regridding is the process of converting a dataset from one grid to another. The motivation may be to convert a dataset from a curvilinear grid to a rectilinear grid, or to resample a dataset to a different grid resolution. ncAggregate supports regridding a curvilinear grid to a rectilinear grid at a customisable resolution. The mechanics of the regridding process are described in detail in the document titled Technical Guide to Derived Products from CSIRO eReefs Models.
ncAggregate accepts the following regridding parameters:
input- The input directory containing the NetCDF files to be regridded. Note that all NetCDF files in this directory must use the same grid.output- The output directory for the generated NetCDF files.cache- A file (and location) for caching the calculations used to map an input grid to an output grid. These calculations are the most intensive part of regridding, so caching allows for faster regridding in future.resolution- The resolution (in decimal degrees) of the output grid. Default value is "0.03".
The file <project root>/env/dev/docker-run-regrid.sh shows an example configuration for regridding. The key section
of the script is
docker run \
-u $(id -u):$(id -g) \
--name "ereefs-ncaggregate" \
--memory=7.5GB \
-v `pwd`:/data/ \
--env EXECUTION_ENVIRONMENT=regrid \
ereefs-ncaggregate --regrid --input=/data/orig --output=/data/out --cache=/data/regrid-mapper.dat
Notes:
docker run- execution is best performed via Docker, requiring the Docker image to be built using the following commands:
$ <project root>/env/dev/maven-package.sh
$ <project root>/env/dev/docker-build-image.sh-v 'pwd':/data/- maps the current directory ('pwd') to/datain the Docker container. This is necessary for ncAggregate to access any files outside of the Docker container.--regrid- instructs ncAggregate to perform a regrid.--input=/data/orig- identifies the input directory. The/dataprefix matches the mapping to the current directory above, making this mean that input files are in theorigsub-directory of the current directory.--output=/data/out- identifies the output directory to use. Again, the/dataprefix matches the mapping to the current directory above, making this mean that output files are written to theoutsub-directory of the current directory.--cache=/data/regrid-mapper.dat- identifies the name (and location) of the regridding map. Again, the/dataprefix matches the mapping to the current directory above, making this mean the cache file is calledregrid-mapper.datand will be created in the current directory.- A resolution has not been specified, allowing ncAggregate to use the default resolution.
Please follow the Standard Github workflow when working on this project.
The domain objects used by ncAggregate are defined in eReefs POJO.
The basic workflow of ncAggregate is as follows:
- Create the output NetCDF file shell.
- Execute the processing pipeline to populate the output NetCDF file.
- Upload the resulting output NetCDF file to S3.
- Populate a Metadata record for the NetCDF file to the database.
Data Extraction Tasks are supported via the ExtractionSitesBuilderTask and the SiteBasedSummaryAccumulatorImpl classes. This borrows directly from the Zone-based summary logic (see ZoneBasedSummaryAccumulatorImpl).
For each site, ncAggregate increases the size of a search box by a specified step size until it
either finds the specified minimum number of neighbours, or it reaches a specified maximum number
of increases and stops. See the ExtractionSitesBuilderTask for more information.
The ApplicationContextBuilder class contains all parameters supported/expected by ncAggregate. The following parameters can only be set via environment variables:
| Env Variable | Description |
|---|---|
| EXECUTION_ENVIRONMENT | The unique prefix for keys in the parameter store. (mandatory) |
| TASK_ID | The unique Id for the Task to be processed. (mandatory) |
| DB_TYPE | The type of database to use. Default is a MongoDB database, but "file" indicates a file-based database. (optional, default is MongoDB) |
| DB_PATH | The path to the root of a file-based database. Mandatory if DB_TYPE is file. |
The following parameters can be set by either environment variables or via the AWS Parameter Store:
| Env Variable | Parameter Store | Description |
|---|---|---|
| MONGODB_HOST | /global/mongodb/host | The name (or IP address) of the MongoDB host. |
| MONGODB_PORT | /global/mongodb/port | The port on which the MongoDB host is listening. Normal value is 27017. |
| MONGODB_DB | /global/mongodb/db | The isolated "schema" of the database. Normal value is "ereefs". |
| MONGODB_USER_ID | /ncAggregate/mongodb/userid | Application-specific user. Normal value is "ncaggregate". |
| MONGODB_PASSWORD | /ncAggregate/mongodb/password | Corresponding password. Value is randomly generated by initialisation script in ereefs-definitions project. |
The ereefs-vmπ project provides a virtual machine for Windows-based developers.
ncAggregate uses several shared libraries which are available via GitHub Packages for Maven. While Jenkins uses the maven-settings.xml file to provide access to the libraries, developers need to take the following steps for a local setup.
-
Create a personal access token with
read:packagespermissions only. Give the token a description likeGithub Package access for Maven. -
Copy the maven-settings.xml file to
~/.m2/settings.xmlif necessary and replace theGITHUB_USERNAMEandGITHUB_TOKENplaceholders with the values from step 1.
All test cases are run in a Maven Docker container.
$ <project root>/env/dev/maven-test.shBefore ncAggregate can be built, the pre-requisites must be completed.
To build and package ncAggregate:
# Package as a Java application.
$ <project root>/env/dev/maven-package.sh
# Then package as a Docker image.
$ <project root>/env/dev/docker-build-image.shDeployment is performed by Jenkins (see <project root>/Jenkinsfile). Any branch can be deployed
to the TEST environment, but only the production branch can be deployed to the PROD
environment.
- Amazon Web Services (AWS) Elastic Container Registry (ECR) repository named
ereefs-netcdf-aggregatorto which the Jenkins Continuous Integration (CI) server will publish Docker images. TheURIfor this ECR repository should be captured in theECR_URLparameter in theJenkinsfilefile of this project. - An AWS Identity and Access Management (IAM)
Groupnamedecr-jenkins-publisherswith theAmazonEC2ContainerRegistryPowerUserpolicy attached. - An AWS IAM
Usernamedecr-jenkins-publisherwho is a member of theecr-jenkins-publishersgroup. - A
Credentialentry namedereefs-ecr-jenkins-publisherin the Jenkins CI server for the AWS IAM Userecr-jenkins-publisher. The name of this Jenkins credential should be captured in theECR_CREDENTIALSparameter in theJenkinsfilefile of this project.