You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
5
+
### Environment
6
6
7
-
*`airflow-dev`: Hosted in the `dnt-dev` AWS account.
8
-
*`airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
9
-
* Deployed from the `main` branch in this repository.
7
+
The development environment breaks down into two categories: Infrastructure and Code. This is because the repo contains both:
10
8
11
-
Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
9
+
* Airflow DAG **code** (the workflows), which need appropriate Python environments.
10
+
* Configuration files which construct the services that make up Airflow (the **infrastructure** which these workflows run upon).
12
11
13
-
There is a helper script in this repository for accessing this Airflow server.
12
+
#### Infrastructure
14
13
15
-
## Development
14
+
The Airflow infrastructure is containerized and orchestrated using Docker Compose for local development. See the [README](./README.md) for instructions on how to set up the development environment. The following files define and configure the Airflow environment:
15
+
16
+
##### Core Infrastructure Files
17
+
18
+
*`docker-compose.yaml` - Orchestrates the multi-container Airflow setup, including:
19
+
* Airflow webserver, scheduler, and workers
20
+
* PostgreSQL database (metadata storage)
21
+
* Redis (message broker for CeleryExecutor)
22
+
* Container networking, volumes, and health checks
23
+
24
+
*`Dockerfile` - Builds the custom Airflow Docker image:
25
+
* All Python DAGs run within this environment
26
+
27
+
*`config/airflow.cfg` - Airflow configuration file that controls:
* Logging, security, and other operational settings
32
+
33
+
##### Development Environment Files
34
+
35
+
*`.devcontainer/devcontainer.json` - VS Code Dev Container configuration for GitHub Codespaces and local development:
36
+
* Configures the development environment, defines VS Code extensions to install, and sets up port forwarding and environment variables.
37
+
38
+
*`.env.example` - Template for environment variables used by Docker Compose:
39
+
* Note that this is not necessarily the preferred way to pass runtime configuration settings
40
+
* Can include Airflow connection strings, AWS credentials for secrets backend, etc.
41
+
42
+
When making changes to infrastructure files (Dockerfile, docker-compose.yaml, config files), you'll need to rebuild the containers to see your changes take effect. (See code example in "Integration Testing" section).
43
+
44
+
#### Code
45
+
46
+
Python dependencies are managed in requirement files.
47
+
48
+
Any python packages needed for DAG tasks or the DAGs themselves belongs in [requirements-airflow.txt](./requirements-airflow.txt).
49
+
50
+
Any python packages needed for development, including running tests, belongs in [requirements-dev.txt](./requirements-dev.txt).
51
+
52
+
### Structure
53
+
54
+
We have structured this repo such that DAG _task_ logic ought to be separate from DAG logic. This makes testing of DAGs as a whole easier, since we can separately test task logic and DAG logic. This breaks down into three directories:
55
+
56
+
-`src/` - This is where DAG task logic belongs. Code is organized as packages that can be imported by DAGs as needed.
57
+
-`dags/` - This is where DAG logic belongs.
58
+
-`tests/` - Unit tests for both the DAG task logic (packages in `src/`) and the DAGs themselves (`dags/`) belongs here. See the "Testing" Section below for more information.
59
+
60
+
There is one additional directory where workflows can be found, although it is not part of the current framework for managing DAGs and their task logic.
61
+
62
+
-`local/` - (DEPRECATED). Project-specific scripts and utilities.
63
+
64
+
### Testing
65
+
66
+
Testing breaks down into two categories: formal testing via unit tests and relatively informal testing via integration tests.
67
+
68
+
#### Unit Testing
69
+
70
+
Unit tests can be found in `tests/`. We use `pytest` as part of a Github actions workflow to automatically run tests when new commits are pushed to a branch. Tests can also be run locally, provided you are working in the appropriate development environment (See [README.md](./README.md) for instruction on how to set up the dev environment).
71
+
72
+
```
73
+
python -m pytest tests/ -v --tb=short
74
+
```
75
+
76
+
Because of the wide variety of use-cases which this repo supports, we further divide tests into subdirectories within `tests/` depending on their domain. For example, the `tests/datacite/` directory contains tests for everything in the `src/datacite/` directory.
77
+
78
+
DAG unit tests belong in the `tests/dags/` directory. Unlike DAG task logic, which is much more diverse, DAG logic is homogenous enough that we can organize all DAG unit tests in a single directory.
16
79
17
-
See the [README](./README.md) for instructions on how to set up the development environment.
80
+
You are welcome to write tests in any form which `pytest` supports, although it is recommended that you make use of fixtures to keep tests easy to maintain and organize unit tests into classes for ease of testing.
81
+
82
+
The below directory structure demonstrates a typical way to keep things organized:
Presently, integration testing means triggering your DAG in Airflow and manually inspecting the results. See the [README.md](README.md) on how to deploy and connect to Airflow.
97
+
98
+
##### DAG Set Up
18
99
19
100
Any edits to your DAG should automatically be picked up by the Airflow scheduler/webserver after a short time interval (see `scheduler.min_file_process_interval` in [airflow.cfg](config/airflow.cfg)). New DAGs are picked up by the scheduler/webserver according to a different interval (see `scheduler.dag_dir_list_interval`). You can force a "hard refresh" by restarting the containers:
20
101
21
102
```console
22
103
docker compose restart
23
104
```
24
105
106
+
##### DAG Testing
107
+
108
+
Integration testing can be performed by triggering a DAG via the Airflow command-line or web UI. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
109
+
110
+
> [!NOTE]
111
+
> Some DAGs use runtime configuration in the form of Params or Connections and Secrets. It's not always well-documented in the DAG itself how the runtime configuration is set up, so if your DAG uses runtime configuration, yet it's not clear how these values are passed through to the DAG itself, it's generally better to test the DAG in GitHub Codespaces.
If you edit `Dockerfile`, `docker-compose.yaml`, `requirements-*.txt`, or configuration files, or otherwise want to redo the build process, rebuild the containers:
26
124
27
125
```console
@@ -32,19 +130,17 @@ docker compose up --build --detach
32
130
# docker compose up --no-cache --build --detach
33
131
```
34
132
35
-
## Testing
133
+
## Deployment Infrastructure
36
134
37
-
Testing should be done via the Dev Containers setup online using GitHub Codespaces. Note that for testing of the DAGs directly on Airflow locally via Dev Containers, it's best to leave the DAG **unpaused** when triggering the DAG with various updates, otherwise you might be triggering the DAG twice and/or triggering it in its original state that had its parameters set to production mode.
135
+
We have both dev and prod Airflow servers, although the dev server is not always running and there may not be feature parity between dev and prod (e.g., not all prod secrets have analogues in dev):
38
136
39
-
Logs can be inspected with docker compose:
40
-
```console
41
-
# All logs
42
-
docker compose logs -f
137
+
*`airflow-dev`: Hosted in the `dnt-dev` AWS account.
138
+
*`airflow-prod`: Hosted in the `dpe-prod` AWS account. Deployed using OpenTofu. Only accessible via [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
139
+
* Deployed from the `main` branch in this repository.
Please see [Connecting to AWS EKS](https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3389325317/Connecting+to+AWS+EKS+Kubernetes+K8s+cluster) on Confluence if you want to interface with the EKS/Kubernetes cluster. Otherwise, for local development you will likely only be interested in using AWS Secrets Manager as a backend for Airflow Secrets.
142
+
143
+
There is a helper script in this repository for accessing this Airflow server.
Copy file name to clipboardExpand all lines: README.md
+7-14Lines changed: 7 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
# ORCA Recipes
2
2
3
-
This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks.
3
+
This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks. If you want to develop a workflow to process data, you've come to the right place.
4
4
5
-
## Key Features
5
+
## Example Workflows
6
6
7
-
-**Challenge Automation**: Dynamic DAG factory for Synapse-hosted challenges using configurable YAML profiles
8
-
-**Data Analytics**: DAGs for Synapse project analytics, trending data, and metrics collection
9
-
-**Dataset Management**: Automated dataset creation, annotation, and metadata processing
10
-
-**Integration Workflows**: Data pipelines connecting Synapse, Snowflake, and other platforms
7
+
-**Challenge Automation** - Automatically evaluate challenge submissions by fetching entries from Synapse and orchestrating Nextflow workflows via Seqera Platform.
8
+
-**Dataset Discovery** - Generate Croissant-format metadata for Synapse datasets and publish to public S3 for improved discoverability.
9
+
-**Analytics Pipelines** - Sync Synapse Portal data to Snowflake and generate platform usage reports tracking downloads, users, and storage.
10
+
-**Bioinformatics QC** - Launch and monitor data quality control workflows for genomics projects (GENIE, HTAN).
11
11
12
12
## Airflow Development
13
13
@@ -121,14 +121,7 @@ source venv/bin/activate
121
121
122
122
## Contributing
123
123
124
-
For detailed contribution guidelines, including DAG development best practices and how to contribute challenge DAGs, see [CONTRIBUTING.md](CONTRIBUTING.md).
125
-
126
-
## Repository Structure
127
-
128
-
-`dags/` - Production Airflow DAGs and challenge configurations
129
-
-`config/` - Airflow configuration files
130
-
-`local/` - Project-specific scripts and utilities
131
-
-`requirements-*.txt` - Python dependencies for different environments
124
+
For detailed contribution guidelines, repository structure, and testing instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).
0 commit comments