-
Notifications
You must be signed in to change notification settings - Fork 1
[DPE-1521]: Create DAG to ingest minimal JSONLD for data catalog items #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also relates to: Sage-Bionetworks-Workflows/eks-stack#75 |
| def _rest_call_replacement( | ||
| self, | ||
| method, | ||
| uri, | ||
| data, | ||
| endpoint, | ||
| headers, | ||
| retryPolicy, | ||
| requests_session, | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| See original _rest_call method in the Synapse client for more details. | ||
| """ | ||
| self.logger.debug(f"Sending {method} request to {uri}") | ||
| uri, headers = self._build_uri_and_headers( | ||
| uri, endpoint=endpoint, headers=headers | ||
| ) | ||
|
|
||
| retryPolicy = self._build_retry_policy(retryPolicy) | ||
| requests_session = requests_session or self._requests_session | ||
|
|
||
| auth = kwargs.pop("auth", self.credentials) | ||
| requests_method_fn = getattr(requests_session, method) | ||
| response = with_retry( | ||
| lambda: requests_method_fn( | ||
| uri, | ||
| data=data, | ||
| headers=headers, | ||
| auth=auth, | ||
| timeout=70, | ||
| **kwargs, | ||
| ), | ||
| verbose=self.debug, | ||
| **retryPolicy, | ||
| ) | ||
| self._handle_synapse_http_error(response) | ||
| return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the TODO comment I had where you grabbed this code from:
# TODO: Remove this on the next > 4.7.0 release of the Synapse Python Client
# This is a temporary hack to include the changes from: https://github.com/Sage-Bionetworks/synapsePythonClient/pull/1188
# The hack is used here because the current SYNPY client does not have an HTTP timeout
# for requests to Synapse. As a result and due to the significant number of HTTP calls that
# occur during the DAG, the DAG can stall and never return due to the requests library.
# https://requests.readthedocs.io/en/latest/user/advanced/#timeouts
If you also wanted to take on upgrading the SYNPY version (I can help with any deployments) then we can remove this code.
It is a little more involved to update a dependency version:
- Sage-Bionetworks-Workflows/py-orca@9605595 should be upgraded
- py-orca should be tagged and released
- should be updated
orca-recipes/requirements-airflow.txt
Line 5 in b8417b6
py-orca[all] == 1.5.2 - https://github.com/Sage-Bionetworks-Workflows/orca-recipes/releases should be tagged and released
- https://github.com/Sage-Bionetworks-Workflows/eks-stack/blob/983d35d8d8674da040ad017ce353acac46722542/modules/apache-airflow/templates/values.yaml#L71 needs to be incremented to the new image version and deployed in spacelift
- The application needs to be sync'd in ArgoCD
BryanFauble
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few changes I noted. Otherwise the logic is looking great. Please re-request after a few small changes.
Co-authored-by: BryanFauble <17128019+BryanFauble@users.noreply.github.com>
… from S3 and Synapse table
BryanFauble
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read over the new round of changes and added a small doc update, but the logic all looks right to me.
|



Problem:
Google is having difficulty indexing the Croissant JSON being injected in portal pages. To improve discoverability, we need to inject minimal JSON-LD metadata into Synapse pages for datasets listed in the data catalog
Related to: https://sagebionetworks.jira.com/browse/DPE-1521
See design doc: https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/4456349721/DPE+-+1521+Create+minimal+JSONLD+for+data+catalog+items
Solution:
This pull request introduces a new Airflow DAG,
synapse_minimal_jsonld_dag.py, which automates the creation and storage of minimal Croissant JSON-LD metadata files for Synapse datasets. The DAG queries a Synapse data catalog, generates JSON-LD files for each dataset, uploads them to S3, and records their S3 links in a Synapse table. The implementation includes robust telemetry with OpenTelemetry for tracing and logging, and customizes Synapse client behavior for reliability.Test
Successful ingestion: syn72041138