Skip to content

Commit 4505e0e

Browse files
authored
Merge pull request #128 from Sage-Bionetworks-Workflows/dpe-1467-datacite
[DPE-1467] Add `datacite` module and test framework
2 parents 9b39522 + b2ab070 commit 4505e0e

File tree

15 files changed

+2254
-33
lines changed

15 files changed

+2254
-33
lines changed

.github/workflows/release.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,27 @@ on:
66
workflow_dispatch:
77

88
jobs:
9+
test:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
14+
- name: Set up Python
15+
uses: actions/setup-python@v5
16+
with:
17+
python-version: '3.10'
18+
19+
- name: Install dependencies
20+
run: |
21+
python -m pip install --upgrade pip
22+
pip install -r requirements-dev.txt
23+
24+
- name: Run tests
25+
run: |
26+
python -m pytest tests/ -v --tb=short
27+
928
ghcr-publish:
29+
needs: test
1030
runs-on: ubuntu-latest
1131
steps:
1232
- uses: actions/checkout@v4

.github/workflows/validate.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
2+
name: Validate
3+
4+
on: # Run on new commits and PR openings
5+
push:
6+
branches:
7+
- '**'
8+
pull_request:
9+
types: [opened, reopened]
10+
11+
jobs:
12+
test:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- name: Set up Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: '3.10'
21+
22+
- name: Install dependencies
23+
run: |
24+
python -m pip install --upgrade pip
25+
pip install -r requirements-dev.txt
26+
27+
- name: Run tests
28+
run: |
29+
python -m pytest tests/ -v --tb=short

CONTRIBUTING.md

Lines changed: 115 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,125 @@
11
# Contribution Guidelines
22

3-
## Infrastructure
3+
## Development
44

5-
We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
5+
### Environment
66

7-
* `airflow-dev`: Hosted in the `dnt-dev` AWS account.
8-
* `airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
9-
* Deployed from the `main` branch in this repository.
7+
The development environment breaks down into two categories: Infrastructure and Code. This is because the repo contains both:
108

11-
Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
9+
* Airflow DAG **code** (the workflows), which need appropriate Python environments.
10+
* Configuration files which construct the services that make up Airflow (the **infrastructure** which these workflows run upon).
1211

13-
There is a helper script in this repository for accessing this Airflow server.
12+
#### Infrastructure
1413

15-
## Development
14+
The Airflow infrastructure is containerized and orchestrated using Docker Compose for local development. See the [README](./README.md) for instructions on how to set up the development environment. The following files define and configure the Airflow environment:
15+
16+
##### Core Infrastructure Files
17+
18+
* `docker-compose.yaml` - Orchestrates the multi-container Airflow setup, including:
19+
* Airflow webserver, scheduler, and workers
20+
* PostgreSQL database (metadata storage)
21+
* Redis (message broker for CeleryExecutor)
22+
* Container networking, volumes, and health checks
23+
24+
* `Dockerfile` - Builds the custom Airflow Docker image:
25+
* All Python DAGs run within this environment
26+
27+
* `config/airflow.cfg` - Airflow configuration file that controls:
28+
* Scheduler behavior and intervals
29+
* Executor settings (CeleryExecutor)
30+
* Secrets backend configuration (AWS Secrets Manager)
31+
* Logging, security, and other operational settings
32+
33+
##### Development Environment Files
34+
35+
* `.devcontainer/devcontainer.json` - VS Code Dev Container configuration for GitHub Codespaces and local development:
36+
* Configures the development environment, defines VS Code extensions to install, and sets up port forwarding and environment variables.
37+
38+
* `.env.example` - Template for environment variables used by Docker Compose:
39+
* Note that this is not necessarily the preferred way to pass runtime configuration settings
40+
* Can include Airflow connection strings, AWS credentials for secrets backend, etc.
41+
42+
When making changes to infrastructure files (Dockerfile, docker-compose.yaml, config files), you'll need to rebuild the containers to see your changes take effect. (See code example in "Integration Testing" section).
43+
44+
#### Code
45+
46+
Python dependencies are managed in requirement files.
47+
48+
Any python packages needed for DAG tasks or the DAGs themselves belongs in [requirements-airflow.txt](./requirements-airflow.txt).
49+
50+
Any python packages needed for development, including running tests, belongs in [requirements-dev.txt](./requirements-dev.txt).
51+
52+
### Structure
53+
54+
We have structured this repo such that DAG _task_ logic ought to be separate from DAG logic. This makes testing of DAGs as a whole easier, since we can separately test task logic and DAG logic. This breaks down into three directories:
55+
56+
- `src/` - This is where DAG task logic belongs. Code is organized as packages that can be imported by DAGs as needed.
57+
- `dags/` - This is where DAG logic belongs.
58+
- `tests/` - Unit tests for both the DAG task logic (packages in `src/`) and the DAGs themselves (`dags/`) belongs here. See the "Testing" Section below for more information.
59+
60+
There is one additional directory where workflows can be found, although it is not part of the current framework for managing DAGs and their task logic.
61+
62+
- `local/` - (DEPRECATED). Project-specific scripts and utilities.
63+
64+
### Testing
65+
66+
Testing breaks down into two categories: formal testing via unit tests and relatively informal testing via integration tests.
67+
68+
#### Unit Testing
69+
70+
Unit tests can be found in `tests/`. We use `pytest` as part of a Github actions workflow to automatically run tests when new commits are pushed to a branch. Tests can also be run locally, provided you are working in the appropriate development environment (See [README.md](./README.md) for instruction on how to set up the dev environment).
71+
72+
```
73+
python -m pytest tests/ -v --tb=short
74+
```
75+
76+
Because of the wide variety of use-cases which this repo supports, we further divide tests into subdirectories within `tests/` depending on their domain. For example, the `tests/datacite/` directory contains tests for everything in the `src/datacite/` directory.
77+
78+
DAG unit tests belong in the `tests/dags/` directory. Unlike DAG task logic, which is much more diverse, DAG logic is homogenous enough that we can organize all DAG unit tests in a single directory.
1679

17-
See the [README](./README.md) for instructions on how to set up the development environment.
80+
You are welcome to write tests in any form which `pytest` supports, although it is recommended that you make use of fixtures to keep tests easy to maintain and organize unit tests into classes for ease of testing.
81+
82+
The below directory structure demonstrates a typical way to keep things organized:
83+
```
84+
tests/
85+
├── mypackage/
86+
│ ├── __init__.py # Package marker
87+
│ ├── conftest.py # Pytest fixtures (auto-discovered)
88+
│ └── test_mypackage.py # Test suite
89+
├── dags/
90+
│ ├── __init__.py # Package marker
91+
│ └── test_mydag.py # Test suite
92+
```
93+
94+
#### Integration Testing
95+
96+
Presently, integration testing means triggering your DAG in Airflow and manually inspecting the results. See the [README.md](README.md) on how to deploy and connect to Airflow.
97+
98+
##### DAG Set Up
1899

19100
Any edits to your DAG should automatically be picked up by the Airflow scheduler/webserver after a short time interval (see `scheduler.min_file_process_interval` in [airflow.cfg](config/airflow.cfg)). New DAGs are picked up by the scheduler/webserver according to a different interval (see `scheduler.dag_dir_list_interval`). You can force a "hard refresh" by restarting the containers:
20101

21102
```console
22103
docker compose restart
23104
```
24105

106+
##### DAG Testing
107+
108+
Integration testing can be performed by triggering a DAG via the Airflow command-line or web UI. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
109+
110+
> [!NOTE]
111+
> Some DAGs use runtime configuration in the form of Params or Connections and Secrets. It's not always well-documented in the DAG itself how the runtime configuration is set up, so if your DAG uses runtime configuration, yet it's not clear how these values are passed through to the DAG itself, it's generally better to test the DAG in GitHub Codespaces.
112+
113+
Logs can be inspected with docker compose:
114+
```console
115+
# All logs
116+
docker compose logs -f
117+
118+
# Logs for a specific service(s)
119+
docker compose ps --services
120+
docker compose logs -f airflow-webserver airflow-scheduler
121+
```
122+
25123
If you edit `Dockerfile`, `docker-compose.yaml`, `requirements-*.txt`, or configuration files, or otherwise want to redo the build process, rebuild the containers:
26124

27125
```console
@@ -32,19 +130,17 @@ docker compose up --build --detach
32130
# docker compose up --no-cache --build --detach
33131
```
34132

35-
## Testing
133+
## Deployment Infrastructure
36134

37-
Testing should be done via the Dev Containers setup online using GitHub Codespaces. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
135+
We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
38136

39-
Logs can be inspected with docker compose:
40-
```console
41-
# All logs
42-
docker compose logs -f
137+
* `airflow-dev`: Hosted in the `dnt-dev` AWS account.
138+
* `airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
139+
* Deployed from the `main` branch in this repository.
43140

44-
# Logs for a specific service(s)
45-
docker compose ps --services
46-
docker compose logs -f airflow-webserver airflow-scheduler
47-
```
141+
Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
142+
143+
There is a helper script in this repository for accessing this Airflow server.
48144

49145
## DAG Development Best Practices
50146

README.md

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# ORCA Recipes
22

3-
This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks.
3+
This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks. If you want to develop a workflow to process data, you've come to the right place.
44

5-
## Key Features
5+
## Example Workflows
66

7-
- **Challenge Automation**: Dynamic DAG factory for Synapse-hosted challenges using configurable YAML profiles
8-
- **Data Analytics**: DAGs for Synapse project analytics, trending data, and metrics collection
9-
- **Dataset Management**: Automated dataset creation, annotation, and metadata processing
10-
- **Integration Workflows**: Data pipelines connecting Synapse, Snowflake, and other platforms
7+
- **Challenge Automation** - Automatically evaluate challenge submissions by fetching entries from Synapse and orchestrating Nextflow workflows via Seqera Platform.
8+
- **Dataset Discovery** - Generate Croissant-format metadata for Synapse datasets and publish to public S3 for improved discoverability.
9+
- **Analytics Pipelines** - Sync Synapse Portal data to Snowflake and generate platform usage reports tracking downloads, users, and storage.
10+
- **Bioinformatics QC** - Launch and monitor data quality control workflows for genomics projects (GENIE, HTAN).
1111

1212
## Airflow Development
1313

@@ -121,14 +121,7 @@ source venv/bin/activate
121121

122122
## Contributing
123123

124-
For detailed contribution guidelines, including DAG development best practices and how to contribute challenge DAGs, see [CONTRIBUTING.md](CONTRIBUTING.md).
125-
126-
## Repository Structure
127-
128-
- `dags/` - Production Airflow DAGs and challenge configurations
129-
- `config/` - Airflow configuration files
130-
- `local/` - Project-specific scripts and utilities
131-
- `requirements-*.txt` - Python dependencies for different environments
124+
For detailed contribution guidelines, repository structure, and testing instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).
132125

133126
## Releases
134127

docker-compose.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,14 @@ x-airflow-common:
7676
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
7777
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
7878
AWS_SESSION_TOKEN: ${AWS_SESSION_TOKEN:-}
79+
# Add src/ to PYTHONPATH so modules can be imported directly
80+
PYTHONPATH: /opt/airflow/src:${PYTHONPATH:-}
7981
volumes:
8082
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
8183
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
8284
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
8385
- ${AIRFLOW_PROJ_DIR:-.}/config/airflow.cfg:/opt/airflow/airflow.cfg #mounts airflow.cfg
86+
- ${AIRFLOW_PROJ_DIR:-.}/src:/opt/airflow/src #mounts src/ for custom modules
8487
user: "${AIRFLOW_UID:-50000}:0"
8588
depends_on:
8689
&airflow-common-depends-on

requirements-airflow.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ slack-sdk >=3.27
88
pendulum~=3.0.0
99
jsonata-python ~=0.5.3
1010
boto3 >=1.7.0,<2.0
11+
requests ~=2.31

requirements-dev.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
fs-synapse >=2.0,<3.0
22
s3fs ~=2023.5
33
metaflow ~=2.9
4+
pytest ~=8.0
5+
pytest-mock ~=3.0
6+
requests ~=2.31

src/datacite/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# DataCite
2+
3+
A Python client for fetching DOI (Digital Object Identifier) metadata from the DataCite REST API.
4+
5+
## Documentation
6+
7+
See the module docstring in `datacite.py` for detailed documentation and examples.

src/datacite/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""DataCite utilities package."""
2+
from .datacite import (
3+
fetch_doi_prefix,
4+
write_ndjson_gz,
5+
)
6+
7+
__all__ = [
8+
"fetch_doi_prefix",
9+
"write_ndjson_gz",
10+
]

0 commit comments

Comments
 (0)