Skip to content

Performs temporal aggregation of the eReefs curved linear NetCDF files and optionally projects them to a regular grid.

License

Notifications You must be signed in to change notification settings

open-AIMS/ereefs-netcdf-aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

eReefs ncAggregate

A Java-based application for performing temporal aggregation and/or regridding of NetCDF files.

ncAggregate is part of the eReefs Platform developed by the Australian Institute of Marine Science (or AIMS), and primarily operates in conjunction with the Job PlannerπŸ”’ and ncAnimate. Job Planner is responsible for planning the work to be performed by ncAggregate and ncAnimate, while ncAnimate creates visualisations from raw NetCDF files and products generated by ncAggregate.

The work to be performed by ncAggregate as part of the AIMS eReefs Platform is defined as Product Definitions in the eReefs DefinitionsπŸ”’ repository.


IMPORTANT!

The AIMS Knowledge Systems team is progressively open sourcing the infrastructure of the AIMS eReefs Platform. Some related repositories mentioned in this and other README files may not yet be generally available to the public. If you wish early access to other parts of the system as a collaborator, please contact the Knowledge Systems team at [email protected].


Table of contents

Repository overview

.
|-- env
|   |-- dev               <-- Scripts useful for development/testing.
|
|-- src
|   |-- main
|       |-- java          <-- Source code for the application.
|       |-- resources     <-- Resources included in the packaged application.
|   |-- test
|       |-- java          <-- Unit test cases.
|       |-- resources     <-- Resources referenced by test cases.
|
|-- cloudformation.yaml   <-- Definition of AWS assets.
|-- Dockerfile            <-- Definition of the Docker image to wrap the packaged application.
|-- Jenkinsfile           <-- The Jenkins job definition for this project.
|-- maven-settings.xml    <-- Maven-specific settings.
|-- pom.xml               <-- The Maven definition file for this project.
|-- README.md             <-- This file.

Execution

ncAggregate is a Java-based application that can be executed from the command-line (see "Java-based execution") on any computer with the necessary pre-requisites installed (currently Java 8 and NetCDF libraries).

During normal operation within the AIMS eReefs Platform, ncAggregate and it's pre-requisites are packaged as a Docker image that executes within an AWS ECS Cluster. This Docker image can be leveraged locally to execute ncAggregate from the command-line (see "Docker-based execution") without installing the pre-requisites (though the Docker runtime must be installed).

During normal operations, ncAggregate expects to obtain detailed instructions from a MongoDB database (see "MongoDB-based repository"). This is the configuration recommended for Production use, though a simple file based solution (see "File-based repository") can be used for development. The database/repository is populated from the ereefs-definitionsπŸ”’ project and the Job Planner.

Furthermore, ncAggregate can operate as a stand-alone regridding utility (see "Stand-alone regridding").

MongoDB-based repository

Note: This scenario requires access to system components that have not yet been open-sourced.

  1. Start a MongoDB instance, such as that provided by the ereefs-vmπŸ”’ project.
  2. Populate the MongoDB instance with Product Definitions. Refer to the ereefs-definitionsπŸ”’ project.
  3. Populate the MongoDB instance with Metadata from downloaded files. This is normally performed by ereefs-download-manager.
  4. Build the list of Tasks by running ereefs-job-plannerπŸ”’.

File-based repository

Note: This scenario requires access to system components that have not yet been open-sourced.

  1. Create a directory to contain the files. The scripts in this project have a default expectation that this directory be /data/ereefs/filedb, however that can be overwritten.
  2. Populate the DB directory with Product definitions. This is normally performed by ereefs definitionsπŸ”’.
  3. Populate the DB directory with Metadata from downloaded files. This is normally performed by ereefs-download-manager.
  4. Build the list of Tasks by running ereefs-job-plannerπŸ”’.

Java-based execution

Note: the pre-requisites (Java 8 and NetCDF libraries) must be installed before ncAggregate can be run from the command-line as a Java-based application.

Package the ncAggregate code and dependencies as a JAR file using the following script:

$ <project root>/env/dev/maven-package.sh

WARNING: Maven requires access to packages hosted on Github. Unfortunately Github requires credentials even though the projects are public. You will need to set the GITHUB_USERNAME and GITHUB_TOKEN environment variables before executing maven-package.sh.

To run ncAggregate as a Java application against a file-based database at /data/ereefs/filedb:

$ <project root>/env/dev/java-run-filedb.sh

To run ncAggregate as a Java application against a file-based database at /my-data:

$ <project root>/env/dev/java-run-filedb.sh -d /my-data

To run ncAggregate as a Java application against a MongoDB-based database:

$ <project root>/env/dev/java-run-mongodb.sh

Docker-based execution

Package the ncAggregate code and dependencies in a Docker image using the following scripts:

$ <project root>/env/dev/maven-package.sh
$ <project root>/env/dev/docker-build-image.sh

To run ncAggregate as a Docker container against a file-based database at /data/ereefs/filedb:

$ <project root>/env/dev/docker-run-filedb.sh -t <task id>

To run ncAggregate as a Docker container against a file-based database at /my-data:

$ <project root>/env/dev/docker-run-filedb.sh -d /my-data -t <task id>

To run ncAggregate as a Docker container against a MongoDB-based database:

$ <project root>/env/dev/docker-run-mongodb.sh -t <task id>

Stand-alone Regridding

Regridding is the process of converting a dataset from one grid to another. The motivation may be to convert a dataset from a curvilinear grid to a rectilinear grid, or to resample a dataset to a different grid resolution. ncAggregate supports regridding a curvilinear grid to a rectilinear grid at a customisable resolution. The mechanics of the regridding process are described in detail in the document titled Technical Guide to Derived Products from CSIRO eReefs Models.

ncAggregate accepts the following regridding parameters:

  • input - The input directory containing the NetCDF files to be regridded. Note that all NetCDF files in this directory must use the same grid.
  • output - The output directory for the generated NetCDF files.
  • cache - A file (and location) for caching the calculations used to map an input grid to an output grid. These calculations are the most intensive part of regridding, so caching allows for faster regridding in future.
  • resolution - The resolution (in decimal degrees) of the output grid. Default value is "0.03".

Example script

The file <project root>/env/dev/docker-run-regrid.sh shows an example configuration for regridding. The key section of the script is

docker run \
    -u $(id -u):$(id -g) \
    --name "ereefs-ncaggregate" \
    --memory=7.5GB \
    -v `pwd`:/data/ \
    --env EXECUTION_ENVIRONMENT=regrid \
    ereefs-ncaggregate --regrid --input=/data/orig --output=/data/out --cache=/data/regrid-mapper.dat

Notes:

  • docker run - execution is best performed via Docker, requiring the Docker image to be built using the following commands:
$ <project root>/env/dev/maven-package.sh
$ <project root>/env/dev/docker-build-image.sh
  • -v 'pwd':/data/ - maps the current directory ('pwd') to /data in the Docker container. This is necessary for ncAggregate to access any files outside of the Docker container.
  • --regrid - instructs ncAggregate to perform a regrid.
  • --input=/data/orig - identifies the input directory. The /data prefix matches the mapping to the current directory above, making this mean that input files are in the orig sub-directory of the current directory.
  • --output=/data/out - identifies the output directory to use. Again, the /data prefix matches the mapping to the current directory above, making this mean that output files are written to the out sub-directory of the current directory.
  • --cache=/data/regrid-mapper.dat - identifies the name (and location) of the regridding map. Again, the /data prefix matches the mapping to the current directory above, making this mean the cache file is called regrid-mapper.dat and will be created in the current directory.
  • A resolution has not been specified, allowing ncAggregate to use the default resolution.

Development

Guidelines

Please follow the Standard Github workflow when working on this project.

Background

The domain objects used by ncAggregate are defined in eReefs POJO.

Workflow

The basic workflow of ncAggregate is as follows:

  1. Create the output NetCDF file shell.
  2. Execute the processing pipeline to populate the output NetCDF file.
  3. Upload the resulting output NetCDF file to S3.
  4. Populate a Metadata record for the NetCDF file to the database.

Significant classes/interfaces

See Technical Explanations.

Data Extraction sites

Data Extraction Tasks are supported via the ExtractionSitesBuilderTask and the SiteBasedSummaryAccumulatorImpl classes. This borrows directly from the Zone-based summary logic (see ZoneBasedSummaryAccumulatorImpl).

For each site, ncAggregate increases the size of a search box by a specified step size until it either finds the specified minimum number of neighbours, or it reaches a specified maximum number of increases and stops. See the ExtractionSitesBuilderTask for more information.

Parameters

The ApplicationContextBuilder class contains all parameters supported/expected by ncAggregate. The following parameters can only be set via environment variables:

Env Variable Description
EXECUTION_ENVIRONMENT The unique prefix for keys in the parameter store. (mandatory)
TASK_ID The unique Id for the Task to be processed. (mandatory)
DB_TYPE The type of database to use. Default is a MongoDB database, but "file" indicates a file-based database. (optional, default is MongoDB)
DB_PATH The path to the root of a file-based database. Mandatory if DB_TYPE is file.

The following parameters can be set by either environment variables or via the AWS Parameter Store:

Env Variable Parameter Store Description
MONGODB_HOST /global/mongodb/host The name (or IP address) of the MongoDB host.
MONGODB_PORT /global/mongodb/port The port on which the MongoDB host is listening. Normal value is 27017.
MONGODB_DB /global/mongodb/db The isolated "schema" of the database. Normal value is "ereefs".
MONGODB_USER_ID /ncAggregate/mongodb/userid Application-specific user. Normal value is "ncaggregate".
MONGODB_PASSWORD /ncAggregate/mongodb/password Corresponding password. Value is randomly generated by initialisation script in ereefs-definitions project.

Virtual Machine

The ereefs-vmπŸ”’ project provides a virtual machine for Windows-based developers.

Pre-requisites

ncAggregate uses several shared libraries which are available via GitHub Packages for Maven. While Jenkins uses the maven-settings.xml file to provide access to the libraries, developers need to take the following steps for a local setup.

  1. Create a personal access token with read:packages permissions only. Give the token a description like Github Package access for Maven.

  2. Copy the maven-settings.xml file to ~/.m2/settings.xml if necessary and replace the GITHUB_USERNAME and GITHUB_TOKEN placeholders with the values from step 1.

Testing

All test cases are run in a Maven Docker container.

$ <project root>/env/dev/maven-test.sh

Build/Package

Before ncAggregate can be built, the pre-requisites must be completed.

To build and package ncAggregate:

# Package as a Java application.
$ <project root>/env/dev/maven-package.sh

# Then package as a Docker image.
$ <project root>/env/dev/docker-build-image.sh

Deployment

Deployment is performed by Jenkins (see <project root>/Jenkinsfile). Any branch can be deployed to the TEST environment, but only the production branch can be deployed to the PROD environment.

Pre-requisites

  • Amazon Web Services (AWS) Elastic Container Registry (ECR) repository named ereefs-netcdf-aggregator to which the Jenkins Continuous Integration (CI) server will publish Docker images. The URI for this ECR repository should be captured in the ECR_URL parameter in the Jenkinsfile file of this project.
  • An AWS Identity and Access Management (IAM) Group named ecr-jenkins-publishers with the AmazonEC2ContainerRegistryPowerUser policy attached.
  • An AWS IAM User named ecr-jenkins-publisher who is a member of the ecr-jenkins-publishers group.
  • A Credential entry named ereefs-ecr-jenkins-publisher in the Jenkins CI server for the AWS IAM User ecr-jenkins-publisher. The name of this Jenkins credential should be captured in the ECR_CREDENTIALS parameter in the Jenkinsfile file of this project.

About

Performs temporal aggregation of the eReefs curved linear NetCDF files and optionally projects them to a regular grid.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •