This repository contains Airflow recipes (DAGs) for data processing and engineering at Sage Bionetworks.
- Challenge Automation: Dynamic DAG factory for Synapse-hosted challenges using configurable YAML profiles
- Data Analytics: DAGs for Synapse project analytics, trending data, and metrics collection
- Dataset Management: Automated dataset creation, annotation, and metadata processing
- Integration Workflows: Data pipelines connecting Synapse, Snowflake, and other platforms
A complete Airflow deployment is made up of multiple services running in parallel, so the steps involved in setting up a dev environment are more complex than you may be used to. There are two steps involved in setting up Airflow for development:
- (Highly recommended) Develop within the provided dev container. A dev container is a virtual machine (e.g., Docker container) that standardizes tools, libraries, and configs for consistent development across machines. This provides us with a consistent environment for the next step.
- Run docker compose to deploy the full suite of containerized services.
There are multiple ways to set up and interface with a dev container, depending on whether you want an IDE-agnostic approach, a VS Code workflow with the Dev Containers extension, or a cloud option like GitHub Codespaces. The cloud option is the most straightforward, and saves us the hassle of configuring Airflow secrets, although because the infrastructure is running in the cloud, there is a limit on how much time we can develop before we need to pay for the service.
- Note: The environment setup for the Dev Container is defined in Dockerfile. How we deploy the container locally is defined in devcontainer.json.
- Create a branch for your changes
- From the main repo page click on
<> Code - Under
Codespacesclick the 3 ellipses andNew with options... - Choose your branch and 4-core (2-core is sufficient for basic edits without docker compose).
Visual Studio Code provides an extension so that your IDE terminal and other development tools are run within a dev container. Follow the instructions here to set up the Dev Containers extension. Do not create a new dev container, but rather use the existing configuration by opening the Command Palette (CMD+Shift+p by default on Mac) → "Dev Containers: Reopen in Container."
With this option, you won't be able to use the pre-configured Airflow Secrets as you would in Codespaces. Alternatively, you can connect to Codespaces as a remote environment from within VS Code.
Ensure that your Docker installation is up to date (we use Docker Compose V2). It's recommended that you deploy from within the included dev container (previous section).
We pass environment variables to our build via the .env file. We use AWS as our Airflow Secrets backend, although if you are deploying within Codespaces, there's no need to include AWS credentials in the .env file since a default IAM user has already been configured in this repository's secrets.
# Duplicate example `.env` file
# Add AWS credentials if you are *not* using Codespaces.
cp .env.example .envTo build and deploy our Airflow services to background containers:
docker compose up --build --detachCongrats! You have completed set up of the dev environment.
Airflow is made up of multiple components or services working together. The webserver exposes a browser-accessible port, but in a development environment we often want to interface with Airflow through its CLI.
We can see which services are currently running:
# list running docker containers
docker compose ps
# see stats
docker compose statsWe provide a convenience script airflow.sh to invoke the Airflow CLI within the same container environment as the webserver/scheduler:
# Start a shell inside one of the containers
./airflow.sh bash
# List our DAGs
./airflow.sh dags listIt may be helpful at this point to verify that Airflow has access to the secrets backend.
./airflow.sh pythonimport boto3
secretsmanager = boto3.client("secretsmanager", region_name='us-east-1')
secret_prefixes = ["airflow/connections/", "airflow/variables/"] # see secrets.backend_kwargs in airflow.cfg
all_secrets = secretsmanager.list_secrets()["SecretList"]
for secret in all_secrets:
if any([secret["Name"].startswith(p) for p in secret_prefixes]):
print(secret["Name"])We can see which ports our Airflow services expose under PORTS:
docker compose ps
Airflow’s webserver listens on port 8080 by default. You can connect in your browser at http://localhost:8080. The username and password will be "airflow".
If you encounter the nginx bad gateway errors when navigating to the forwarded port, just wait and refresh a couple of times. Airflow takes a few minutes to become available.
This repository also contains recipes for specific projects that either don't need to be deployed to Airflow or are not ready to be deployed to Airflow. These recipes can be run locally from the local directory. Each sub-directory contains recipes specific to a project and those project folders have their own documentation for running the recipes.
For local development outside of Docker, we provide a convenience script to set up a Python virtual environment:
bash local/dev_setup.sh
source venv/bin/activateFor detailed contribution guidelines, including DAG development best practices and how to contribute challenge DAGs, see CONTRIBUTING.md.
dags/- Production Airflow DAGs and challenge configurationsconfig/- Airflow configuration fileslocal/- Project-specific scripts and utilitiesrequirements-*.txt- Python dependencies for different environments
To release a new version of the orca-recipes container to GHCR:
- Create a new GitHub Release in the repository
- Go to the repository's "Releases" page
- Click "Create a new release"
- Create a new tag with the version number (e.g.,
1.0.0) - Add release notes
- Click "Publish release"
The GitHub Actions workflow will automatically:
- Build the Docker image
- Tag it with the release version
- Push it to GHCR
The latest tag will automatically be updated to point to the latest release.