Academic Observatory Workflows provides Apache Airflow workflows for fetching, processing and analysing data about academic institutions.
A telescope a type of workflow used to ingest data from different data sources, and to run workflows that process and output data to other places. Workflows are built on top of Apache Airflow's DAGs.
The workflows include: Crossref Events, Crossref Fundref, Crossref Metadata, Geonames, OpenAlex, Open Citations, ORCID, PubMed, ROR, Scopus, Unpaywall and Web of Science.
For detailed documentation about the Academic Observatory see the Read the Docs website https://academic-observatory-workflows.readthedocs.io
Install using pip. From the root directory:
pip install -e ./academic-observatory-workflows[tests] --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.11.2/constraints-3.10.txtThese instructions show how to deploy the workflows to Google Cloud and Astronomer.io.
You should have set up the following resources already:
- A Google Cloud Project.
- A Google Cloud Shell instance, which pre-installs gsutil, gcloud and kubectl.
- A GKE Autopilot Cluster.
- An Astonomer.io Airflow deployment, using Google Cloud.
- Installed the Astronomer.io CLI: https://www.astronomer.io/docs/astro/cli/install-cli
- Installed yq: https://github.com/mikefarah/yq (don't use sudo apt install yq, it installs the wrong tool)
The GKE Autopilot Cluster, Astonomer.io deployment and the Google Cloud buckets (that you create with the below script), should all be in the same region. The Cloud Storage buckets should be in a single region, not a dual or multi region, otherwise you will pay network costs for replication.
In a Google Cloud Shell, run the following script to set up your Google Cloud Project:
./bin/setup-gcloud-project.sh gcp-project-id gke-cluster-name gke-namespace gcp-download-bucket-name gcp-transform-bucket-nameThe script outputs information that you need for subequent steps:
- AO Astro Service Account: required to set up the 'Customer Managed Identity' in Astronomer.io.
- Kube Config Path: required to configure the gke_cluster Airflow Connection.
If you are using additional buckets, then you can enable GKE and or Astro to access them with the following command:
./bin/setup-bucket-permissions.sh bucket-name service-account-emailThe AO Astro Service Account needs to be attached to the Astronomer.io deployment as a "Customer Managed Identity".
Please follow these steps to set it up: https://www.astronomer.io/docs/astro/authorize-deployments-to-your-cloud/?tab=gcp#setup
Step 6 is not necessary.
The Airflow workflows are configured with a config file that is stored as an Airflow Variable. Copy
config-example.yaml to config-prod.yaml and customise the settings.
Then deploy your config with the following command:
./bin/deploy-config astro-deployment-id gcp-project-id config-prod.yamlYou will also need to create the following Airflow Connections, depending on what workflows you are using:
| Connection ID | Type | Login | Password | Host | Namespace | Kube config (JSON format) | Notes |
|---|---|---|---|---|---|---|---|
| aws_openalex | aws | required | required | OpenAlex Telescope | |||
| aws_orcid | aws | required | required | ORCID Telescope | |||
| crossref_metadata | http | required | Crossref Metadata Telescope | ||||
| oa_dashboard_github_token | http | required | OA Dashboard Workflow | ||||
| oa_dashboard_zenodo_token | http | required | OA Dashboard Workflow | ||||
| scopus_key_1 | http | required | Scopus Telescope | ||||
| unpaywall | http | required | Unpaywall Telescope | ||||
| slack | slackwebhook | required | required | Enables failure notifications to be sent to Slack | |||
| gke_cluster | kubernetes | required | required | Enables communication with the GKE Autopilot Cluster. Required for Crossref Metadata, OpenAlex, PubMed, ORCID and Unpaywall. |
Kubernetes Pods can't access Airflow Connections, so some workflows that need access to secrets, need them to be stored as Kubernetes secrets as well. You can create them with the below commands.
Create Unpaywall API key secret:
kubectl create secret generic unpaywall \
--from-literal=api-key=value \
--namespace my-gke-namespaceCreate Crossref Metadata API secret:
kubectl create secret generic crossref-metadata \
--from-literal=api-key=value \
--namespace my-gke-namespaceTo deploy the project to Astronomer.io:
./bin/deploy.sh gcp-project-id astro-deployment-id
