Skip to content

boettiger-lab/data-workflows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Workflows

Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.

Published Datasets

Browse the full catalog in STAC Browser:

radiantearth.github.io/stac-browser → Boettiger Lab Datasets

Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).

This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.

How It Works

  1. You run cng-datasets workflow on your laptop — it generates Kubernetes Job YAML files
  2. You kubectl apply those files — the cluster does all the processing
  3. Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet

You never process data locally. Your laptop just generates YAML and talks to kubectl.

Quick Start

# Install the CLI (one-time)
pip install cng-datasets

# Generate a processing pipeline for a dataset
cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gdb \
  --bucket public-mydata \
  --layer MyLayer \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 32Gi \
  --max-completions 200 \
  --max-parallelism 50 \
  --output-dir catalog/mydata/k8s/mylayer

# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml

# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
              -f catalog/mydata/k8s/mylayer/workflow.yaml

# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflow

That's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.

Detailed Instructions

Repository Structure

catalog/
  <dataset>/
    k8s/           # Generated Kubernetes job YAML
    stac/          # README.md and stac-collection.json for the dataset
    *.ipynb        # Any exploratory notebooks (optional)

Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.

CLI Reference

See the cng-datasets README for full CLI documentation.

Key commands:

Command What it does Where it runs
cng-datasets workflow Generates k8s job YAML Your laptop
kubectl apply -f ... Submits jobs to the cluster Your laptop
kubectl get jobs Monitors job status Your laptop
Everything else Processing, S3 uploads, etc. Kubernetes pods

Infrastructure

  • Cluster: NRP Nautilus, namespace biodiversity
  • S3: Ceph object storage (S3-compatible, not AWS)
  • Public endpoint: https://s3-west.nrp-nautilus.io/<bucket>/<path>
  • Secrets: aws and rclone-config are pre-configured in the namespace

See .github/copilot-instructions.md for detailed infrastructure context.

About

Dataset processing workflows for cloud-native geospatial data on NRP Kubernetes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors