Skip to content

Commit bfffe5f

Browse files
committed
add description
1 parent 8de96b0 commit bfffe5f

File tree

1 file changed

+24
-0
lines changed

1 file changed

+24
-0
lines changed

dags/synapse_minimal_jsonld_dag.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11

2+
"""
3+
This DAG interacts with a few different services to accomplish the following:
24
5+
- (Synapse, Authenticated) Query the Synapse Data Catalog to retrieve all datasets listed on the data catalog homepage
6+
- (Synapse, Authenticated) For each dataset in the data catalog, retrieve metadata including name, description, contributors, and license
7+
- (Synapse, Authenticated) Generate minimal Schema.org JSON-LD metadata files for each dataset following the minimal Croissant format
8+
- (S3, Authenticated) For each dataset, upload the minimal JSON-LD file to the `synapse-croissant-metadata-minimal` public S3 bucket in the `org-sagebase-dpe-prod` AWS account
9+
- (Synapse, Authenticated) For each dataset, query the Synapse table to check if a link to the S3 object already exists
10+
- (Synapse, Authenticated) Store or update the S3 object URL in the Synapse table for each dataset
11+
12+
13+
This DAG addresses the issue where Google has difficulty indexing Croissant JSON embedded in portal pages.
14+
15+
The workflow for making datasets discoverable to Google:
16+
1. This DAG generates minimal JSON-LD files and uploads them to a publicly accessible S3 bucket
17+
2. This DAG stores the S3 URLs in a Synapse table (syn72041138)
18+
3. When a Synapse dataset webpage is opened in the portal, the Synapse Web Client queries this table
19+
4. If a croissant file link exists for that dataset, the Synapse Web Client injects it into the HTML of the page
20+
5. Google crawler reads the JSON-LD from the HTML and indexes the dataset for Google Datasets search
21+
22+
See synapse-dataset-to-croissant.py for additional note on the pushing to S3.
23+
24+
DAG Parameters:
25+
- Review the DAG Parameters under the `@dag` decorated function
26+
"""
327
from airflow.decorators import dag, task
428
from datetime import datetime
529
from synapseclient.models import query

0 commit comments

Comments
 (0)