Skip to content

Conversation

@linglp
Copy link
Contributor

@linglp linglp commented Dec 26, 2025

Problem:

Google is having difficulty indexing the Croissant JSON being injected in portal pages. To improve discoverability, we need to inject minimal JSON-LD metadata into Synapse pages for datasets listed in the data catalog

Related to: https://sagebionetworks.jira.com/browse/DPE-1521
See design doc: https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/4456349721/DPE+-+1521+Create+minimal+JSONLD+for+data+catalog+items

Solution:

This pull request introduces a new Airflow DAG, synapse_minimal_jsonld_dag.py, which automates the creation and storage of minimal Croissant JSON-LD metadata files for Synapse datasets. The DAG queries a Synapse data catalog, generates JSON-LD files for each dataset, uploads them to S3, and records their S3 links in a Synapse table. The implementation includes robust telemetry with OpenTelemetry for tracing and logging, and customizes Synapse client behavior for reliability.

Test

Successful ingestion: syn72041138

@linglp linglp requested a review from a team as a code owner December 26, 2025 17:16
@linglp linglp changed the title [] [DPE-1521]: Create DAG to ingest minimal JSONLD for data catalog items Dec 26, 2025
@linglp
Copy link
Contributor Author

linglp commented Dec 26, 2025

Also relates to: Sage-Bionetworks-Workflows/eks-stack#75

Comment on lines 172 to 209
def _rest_call_replacement(
self,
method,
uri,
data,
endpoint,
headers,
retryPolicy,
requests_session,
**kwargs,
):
"""
See original _rest_call method in the Synapse client for more details.
"""
self.logger.debug(f"Sending {method} request to {uri}")
uri, headers = self._build_uri_and_headers(
uri, endpoint=endpoint, headers=headers
)

retryPolicy = self._build_retry_policy(retryPolicy)
requests_session = requests_session or self._requests_session

auth = kwargs.pop("auth", self.credentials)
requests_method_fn = getattr(requests_session, method)
response = with_retry(
lambda: requests_method_fn(
uri,
data=data,
headers=headers,
auth=auth,
timeout=70,
**kwargs,
),
verbose=self.debug,
**retryPolicy,
)
self._handle_synapse_http_error(response)
return response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the TODO comment I had where you grabbed this code from:

# TODO: Remove this on the next > 4.7.0 release of the Synapse Python Client
# This is a temporary hack to include the changes from: https://github.com/Sage-Bionetworks/synapsePythonClient/pull/1188
# The hack is used here because the current SYNPY client does not have an HTTP timeout
# for requests to Synapse. As a result and due to the significant number of HTTP calls that
# occur during the DAG, the DAG can stall and never return due to the requests library.
# https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

If you also wanted to take on upgrading the SYNPY version (I can help with any deployments) then we can remove this code.

It is a little more involved to update a dependency version:

  1. Sage-Bionetworks-Workflows/py-orca@9605595 should be upgraded
  2. py-orca should be tagged and released
  3. py-orca[all] == 1.5.2
    should be updated
  4. https://github.com/Sage-Bionetworks-Workflows/orca-recipes/releases should be tagged and released
  5. https://github.com/Sage-Bionetworks-Workflows/eks-stack/blob/983d35d8d8674da040ad017ce353acac46722542/modules/apache-airflow/templates/values.yaml#L71 needs to be incremented to the new image version and deployed in spacelift
  6. The application needs to be sync'd in ArgoCD

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few changes I noted. Otherwise the logic is looking great. Please re-request after a few small changes.

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read over the new round of changes and added a small doc update, but the logic all looks right to me.

@BryanFauble BryanFauble merged commit 8814576 into main Jan 2, 2026
2 checks passed
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 2, 2026

@BryanFauble BryanFauble deleted the dpe-1521 branch January 2, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants