|
1 | 1 |
|
| 2 | +""" |
| 3 | +This DAG interacts with a few different services to accomplish the following: |
2 | 4 |
|
| 5 | +- (Synapse, Authenticated) Query the Synapse Data Catalog to retrieve all datasets listed on the data catalog homepage |
| 6 | +- (Synapse, Authenticated) For each dataset in the data catalog, retrieve metadata including name, description, contributors, and license |
| 7 | +- (Synapse, Authenticated) Generate minimal Schema.org JSON-LD metadata files for each dataset following the minimal Croissant format |
| 8 | +- (S3, Authenticated) For each dataset, upload the minimal JSON-LD file to the `synapse-croissant-metadata-minimal` public S3 bucket in the `org-sagebase-dpe-prod` AWS account |
| 9 | +- (Synapse, Authenticated) For each dataset, query the Synapse table to check if a link to the S3 object already exists |
| 10 | +- (Synapse, Authenticated) Store or update the S3 object URL in the Synapse table for each dataset |
| 11 | +
|
| 12 | +
|
| 13 | +This DAG addresses the issue where Google has difficulty indexing Croissant JSON embedded in portal pages. |
| 14 | +
|
| 15 | +The workflow for making datasets discoverable to Google: |
| 16 | +1. This DAG generates minimal JSON-LD files and uploads them to a publicly accessible S3 bucket |
| 17 | +2. This DAG stores the S3 URLs in a Synapse table (syn72041138) |
| 18 | +3. When a Synapse dataset webpage is opened in the portal, the Synapse Web Client queries this table |
| 19 | +4. If a croissant file link exists for that dataset, the Synapse Web Client injects it into the HTML of the page |
| 20 | +5. Google crawler reads the JSON-LD from the HTML and indexes the dataset for Google Datasets search |
| 21 | +
|
| 22 | +See synapse-dataset-to-croissant.py for additional note on the pushing to S3. |
| 23 | +
|
| 24 | +DAG Parameters: |
| 25 | +- Review the DAG Parameters under the `@dag` decorated function |
| 26 | +""" |
3 | 27 | from airflow.decorators import dag, task |
4 | 28 | from datetime import datetime |
5 | 29 | from synapseclient.models import query |
|
0 commit comments