Data Workflows

Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.

Published Datasets

Browse the full catalog in STAC Browser:

radiantearth.github.io/stac-browser → Boettiger Lab Datasets

Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).

This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.

How It Works

You run cng-datasets workflow on your laptop — it generates Kubernetes Job YAML files
You kubectl apply those files — the cluster does all the processing
Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet

You never process data locally. Your laptop just generates YAML and talks to kubectl.

Quick Start

# Install the CLI (one-time)
pip install cng-datasets

# Generate a processing pipeline for a dataset
cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gdb \
  --bucket public-mydata \
  --layer MyLayer \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 32Gi \
  --max-completions 200 \
  --max-parallelism 50 \
  --output-dir catalog/mydata/k8s/mylayer

# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml

# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
              -f catalog/mydata/k8s/mylayer/workflow.yaml

# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflow

That's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.

Detailed Instructions

AGENTS.md — Complete step-by-step guide for processing datasets (for humans and LLM agents)
DATASET_DOCUMENTATION_WORKFLOW.md — How to create README and STAC metadata after processing
todo.md — Tracking status of all datasets

Repository Structure

catalog/
  <dataset>/
    k8s/           # Generated Kubernetes job YAML
    stac/          # README.md and stac-collection.json for the dataset
    *.ipynb        # Any exploratory notebooks (optional)

Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.

CLI Reference

See the cng-datasets README for full CLI documentation.

Key commands:

Command	What it does	Where it runs
`cng-datasets workflow`	Generates k8s job YAML	Your laptop
`kubectl apply -f ...`	Submits jobs to the cluster	Your laptop
`kubectl get jobs`	Monitors job status	Your laptop
Everything else	Processing, S3 uploads, etc.	Kubernetes pods

Infrastructure

Cluster: NRP Nautilus, namespace biodiversity
S3: Ceph object storage (S3-compatible, not AWS)
Public endpoint: https://s3-west.nrp-nautilus.io/<bucket>/<path>
Secrets: aws and rclone-config are pre-configured in the namespace

See .github/copilot-instructions.md for detailed infrastructure context.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
catalog		catalog
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
DATASET_DOCUMENTATION_WORKFLOW.md		DATASET_DOCUMENTATION_WORKFLOW.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Workflows

Published Datasets

How It Works

Quick Start

Detailed Instructions

Repository Structure

CLI Reference

Infrastructure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Workflows

Published Datasets

How It Works

Quick Start

Detailed Instructions

Repository Structure

CLI Reference

Infrastructure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages