Sage-Bionetworks-Workflows
diff --git a/‎.github/workflows/release.yml‎
Lines changed: 20 additions & 0 deletions b/‎.github/workflows/release.yml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎.github/workflows/validate.yml‎
Lines changed: 29 additions & 0 deletions b/‎.github/workflows/validate.yml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 115 additions & 19 deletions b/‎CONTRIBUTING.md‎
Lines changed: 115 additions & 19 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 14 deletions b/‎README.md‎
Lines changed: 7 additions & 14 deletions
diff --git a/‎docker-compose.yaml‎
Lines changed: 3 additions & 0 deletions b/‎docker-compose.yaml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎requirements-airflow.txt‎
Lines changed: 1 addition & 0 deletions b/‎requirements-airflow.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎requirements-dev.txt‎
Lines changed: 3 additions & 0 deletions b/‎requirements-dev.txt‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/datacite/README.md‎
Lines changed: 7 additions & 0 deletions b/‎src/datacite/README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎src/datacite/__init__.py‎
Lines changed: 10 additions & 0 deletions b/‎src/datacite/__init__.py‎
Lines changed: 10 additions & 0 deletions
@@ -6,7 +6,27 @@ on:
   workflow_dispatch:
 
 jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+          
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements-dev.txt
+          
+      - name: Run tests
+        run: |
+          python -m pytest tests/ -v --tb=short
+  
   ghcr-publish:
+    needs: test
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
 
@@ -0,0 +1,29 @@
+
+name: Validate
+
+on: # Run on new commits and PR openings
+  push:
+    branches:
+      - '**' 
+  pull_request:
+    types: [opened, reopened]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+          
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements-dev.txt
+          
+      - name: Run tests
+        run: |
+          python -m pytest tests/ -v --tb=short
@@ -1,27 +1,125 @@
 # Contribution Guidelines
 
-## Infrastructure
+## Development
 
-We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
+### Environment
 
-* `airflow-dev`: Hosted in the `dnt-dev` AWS account.
-* `airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
-	* Deployed from the `main` branch in this repository.
+The development environment breaks down into two categories: Infrastructure and Code. This is because the repo contains both:
 
-Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
+* Airflow DAG **code** (the workflows), which need appropriate Python environments.
+* Configuration files which construct the services that make up Airflow (the **infrastructure** which these workflows run upon).
 
-There is a helper script in this repository for accessing this Airflow server.
+#### Infrastructure
 
-## Development
+The Airflow infrastructure is containerized and orchestrated using Docker Compose for local development. See the [README](./README.md) for instructions on how to set up the development environment. The following files define and configure the Airflow environment:
+
+##### Core Infrastructure Files
+
+* `docker-compose.yaml` - Orchestrates the multi-container Airflow setup, including:
+   * Airflow webserver, scheduler, and workers
+   * PostgreSQL database (metadata storage)
+   * Redis (message broker for CeleryExecutor)
+   * Container networking, volumes, and health checks
+
+* `Dockerfile` - Builds the custom Airflow Docker image:
+   * All Python DAGs run within this environment 
+
+* `config/airflow.cfg` - Airflow configuration file that controls:
+   * Scheduler behavior and intervals
+   * Executor settings (CeleryExecutor)
+   * Secrets backend configuration (AWS Secrets Manager)
+   * Logging, security, and other operational settings
+
+##### Development Environment Files
+
+* `.devcontainer/devcontainer.json` - VS Code Dev Container configuration for GitHub Codespaces and local development:
+   * Configures the development environment, defines VS Code extensions to install, and sets up port forwarding and environment variables.
+
+* `.env.example` - Template for environment variables used by Docker Compose:
+   * Note that this is not necessarily the preferred way to pass runtime configuration settings
+   * Can include Airflow connection strings, AWS credentials for secrets backend, etc.
+
+When making changes to infrastructure files (Dockerfile, docker-compose.yaml, config files), you'll need to rebuild the containers to see your changes take effect. (See code example in "Integration Testing" section).
+
+#### Code
+
+Python dependencies are managed in requirement files.
+
+Any python packages needed for DAG tasks or the DAGs themselves belongs in [requirements-airflow.txt](./requirements-airflow.txt).
+
+Any python packages needed for development, including running tests, belongs in [requirements-dev.txt](./requirements-dev.txt).
+
+### Structure
+
+We have structured this repo such that DAG _task_ logic ought to be separate from DAG logic. This makes testing of DAGs as a whole easier, since we can separately test task logic and DAG logic. This breaks down into three directories:
+
+- `src/` - This is where DAG task logic belongs. Code is organized as packages that can be imported by DAGs as needed.
+- `dags/` - This is where DAG logic belongs.
+- `tests/` - Unit tests for both the DAG task logic (packages in `src/`) and the DAGs themselves (`dags/`) belongs here. See the "Testing" Section below for more information.
+
+There is one additional directory where workflows can be found, although it is not part of the current framework for managing DAGs and their task logic.
+
+- `local/` - (DEPRECATED). Project-specific scripts and utilities.
+
+### Testing
+
+Testing breaks down into two categories: formal testing via unit tests and relatively informal testing via integration tests.
+
+#### Unit Testing
+
+Unit tests can be found in `tests/`. We use `pytest` as part of a Github actions workflow to automatically run tests when new commits are pushed to a branch. Tests can also be run locally, provided you are working in the appropriate development environment (See [README.md](./README.md) for instruction on how to set up the dev environment).
+
+```
+python -m pytest tests/ -v --tb=short
+```
+
+Because of the wide variety of use-cases which this repo supports, we further divide tests into subdirectories within `tests/` depending on their domain. For example, the `tests/datacite/` directory contains tests for everything in the `src/datacite/` directory. 
+
+DAG unit tests belong in the `tests/dags/` directory. Unlike DAG task logic, which is much more diverse, DAG logic is homogenous enough that we can organize all DAG unit tests in a single directory.  
 
-See the [README](./README.md) for instructions on how to set up the development environment.
+You are welcome to write tests in any form which `pytest` supports, although it is recommended that you make use of fixtures to keep tests easy to maintain and organize unit tests into classes for ease of testing.
+
+The below directory structure demonstrates a typical way to keep things organized:
+```
+tests/
+├── mypackage/
+│   ├── __init__.py           # Package marker
+│   ├── conftest.py           # Pytest fixtures (auto-discovered)
+│   └── test_mypackage.py     # Test suite
+├── dags/
+│   ├── __init__.py           # Package marker
+│   └── test_mydag.py         # Test suite
+```
+
+#### Integration Testing
+
+Presently, integration testing means triggering your DAG in Airflow and manually inspecting the results. See the [README.md](README.md) on how to deploy and connect to Airflow.
+
+##### DAG Set Up
 
 Any edits to your DAG should automatically be picked up by the Airflow scheduler/webserver after a short time interval (see `scheduler.min_file_process_interval` in [airflow.cfg](config/airflow.cfg)). New DAGs are picked up by the scheduler/webserver according to a different interval (see `scheduler.dag_dir_list_interval`). You can force a "hard refresh" by restarting the containers:
 
 ```console
 docker compose restart
 ```
 
+##### DAG Testing
+
+Integration testing can be performed by triggering a DAG via the Airflow command-line or web UI. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
+
+> [!NOTE]
+> Some DAGs use runtime configuration in the form of Params or Connections and Secrets. It's not always well-documented in the DAG itself how the runtime configuration is set up, so if your DAG uses runtime configuration, yet it's not clear how these values are passed through to the DAG itself, it's generally better to test the DAG in GitHub Codespaces. 
+
+Logs can be inspected with docker compose:
+```console
+# All logs
+docker compose logs -f
+
+# Logs for a specific service(s)
+docker compose ps --services
+docker compose logs -f airflow-webserver airflow-scheduler
+```
+
 If you edit `Dockerfile`, `docker-compose.yaml`, `requirements-*.txt`, or configuration files, or otherwise want to redo the build process, rebuild the containers:
 
 ```console
@@ -32,19 +130,17 @@ docker compose up --build --detach
 # docker compose up --no-cache --build --detach 
 ```
 
-## Testing
+## Deployment Infrastructure
 
-Testing should be done via the Dev Containers setup online using GitHub Codespaces. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
+We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
 
-Logs can be inspected with docker compose:
-```console
-# All logs
-docker compose logs -f
+* `airflow-dev`: Hosted in the `dnt-dev` AWS account.
+* `airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
+	* Deployed from the `main` branch in this repository.
 
-# Logs for a specific service(s)
-docker compose ps --services
-docker compose logs -f airflow-webserver airflow-scheduler
-```
+Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
+
+There is a helper script in this repository for accessing this Airflow server.
 
 ## DAG Development Best Practices
 
 
@@ -1,13 +1,13 @@
 # ORCA Recipes
 
-This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks.
+This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks. If you want to develop a workflow to process data, you've come to the right place.
 
-## Key Features
+## Example Workflows
 
-- **Challenge Automation**: Dynamic DAG factory for Synapse-hosted challenges using configurable YAML profiles
-- **Data Analytics**: DAGs for Synapse project analytics, trending data, and metrics collection
-- **Dataset Management**: Automated dataset creation, annotation, and metadata processing
-- **Integration Workflows**: Data pipelines connecting Synapse, Snowflake, and other platforms
+- **Challenge Automation** - Automatically evaluate challenge submissions by fetching entries from Synapse and orchestrating Nextflow workflows via Seqera Platform.
+- **Dataset Discovery** - Generate Croissant-format metadata for Synapse datasets and publish to public S3 for improved discoverability.
+- **Analytics Pipelines** - Sync Synapse Portal data to Snowflake and generate platform usage reports tracking downloads, users, and storage.
+- **Bioinformatics QC** - Launch and monitor data quality control workflows for genomics projects (GENIE, HTAN).
 
 ## Airflow Development
 
@@ -121,14 +121,7 @@ source venv/bin/activate
 
 ## Contributing
 
-For detailed contribution guidelines, including DAG development best practices and how to contribute challenge DAGs, see [CONTRIBUTING.md](CONTRIBUTING.md).
-
-## Repository Structure
-
-- `dags/` - Production Airflow DAGs and challenge configurations
-- `config/` - Airflow configuration files
-- `local/` - Project-specific scripts and utilities
-- `requirements-*.txt` - Python dependencies for different environments
+For detailed contribution guidelines, repository structure, and testing instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).
 
 ## Releases
 
 
@@ -76,11 +76,14 @@ x-airflow-common:
     AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
     AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
     AWS_SESSION_TOKEN: ${AWS_SESSION_TOKEN:-}
+    # Add src/ to PYTHONPATH so modules can be imported directly
+    PYTHONPATH: /opt/airflow/src:${PYTHONPATH:-}
   volumes:
     - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
     - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
     - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
     - ${AIRFLOW_PROJ_DIR:-.}/config/airflow.cfg:/opt/airflow/airflow.cfg #mounts airflow.cfg
+    - ${AIRFLOW_PROJ_DIR:-.}/src:/opt/airflow/src #mounts src/ for custom modules
   user: "${AIRFLOW_UID:-50000}:0"
   depends_on:
     &airflow-common-depends-on
 
@@ -8,3 +8,4 @@ slack-sdk >=3.27
 pendulum~=3.0.0
 jsonata-python ~=0.5.3
 boto3 >=1.7.0,<2.0
+requests ~=2.31
@@ -1,3 +1,6 @@
 fs-synapse >=2.0,<3.0
 s3fs ~=2023.5
 metaflow ~=2.9
+pytest ~=8.0
+pytest-mock ~=3.0
+requests ~=2.31
@@ -0,0 +1,7 @@
+# DataCite 
+
+A Python client for fetching DOI (Digital Object Identifier) metadata from the DataCite REST API.
+
+## Documentation
+
+See the module docstring in `datacite.py` for detailed documentation and examples.
@@ -0,0 +1,10 @@
+"""DataCite utilities package."""
+from .datacite import (
+    fetch_doi_prefix,
+    write_ndjson_gz,
+)
+
+__all__ = [
+    "fetch_doi_prefix",
+    "write_ndjson_gz",
+]