Skip to content

[HCA DCP] Add HCA projects to Google Datasets catalog #4806

@NoopDog

Description

@NoopDog

Add HCA DCP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.

Companion to galaxyproject/brc-analytics#1264 (same approach, different catalog).

Reference implementation

NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:

Google Dataset required + recommended fields

Per Google's Dataset structured data guidelines:

Required

  • name — descriptive title
  • description — 50–5000 characters

Recommended

  • identifier, url, sameAs
  • creator, funder, license
  • distribution (with contentUrl, encodingFormat)
  • keywords, variableMeasured, measurementTechnique
  • spatialCoverage, temporalCoverage
  • includedInDataCatalog, isAccessibleForFree, version, citation

Initial mapping — HCA project (ProjectResponse) → Dataset

Source entity at app/apis/azul/hca-dcp/common/entities.ts (ProjectResponse).

schema.org field Source / value
@context "https://schema.org"
@type "Dataset"
name projectTitle (fall back to projectShortname)
description projectDescription — strip HTML, truncate to 5000 chars (min 50)
identifier [projectId, ...accessions] (HCA project UUID plus mapped INSDC/GEO/ArrayExpress accessions)
url ${browserURL}/projects/${projectId}
sameAs Accession URLs (e.g. GEO, ArrayExpress, INSDC) derived from accessions
includedInDataCatalog { "@type": "DataCatalog", name: "Human Cell Atlas Data Coordination Platform", url: browserURL }
isAccessibleForFree true
keywords Union of genusSpecies, organ, organPart, disease, sampleEntityType, libraryConstructionApproach, etc.
creator Map contributorsPerson/Organization (name, affiliation, role)
funder Map funding sources from projectResponse.funders if present
citation Map publicationsScholarlyArticle (title, DOI/URL)
distribution DataDownload[] from matrix files / contributed analyses with contentUrl + encodingFormat
variableMeasured Optional PropertyValue[] derived from projectSummary (cell counts, donor counts, file counts, library construction approach)
license TBD — confirm with team (HCA data use terms)

Open questions for funder / license / which file distributions to expose should be resolved before merge.

Implementation steps

  1. Add app/utils/schemaOrg.ts (or HCA-namespaced equivalent) with SchemaDataset types and buildProjectJsonLd(project, browserURL).
  2. Add a ProjectJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
  3. Mount the component on the project detail page (pages/projects/[entityId]).
  4. Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null.
  5. Validate output against Google's Rich Results Test and Schema Markup Validator for representative projects (single-cell, spatial, multi-organ).
  6. Once shipped, request indexing via Google Search Console and confirm project pages start appearing in Google Dataset Search.

Out of scope (follow-ups)

  • JSON-LD on samples/files detail pages.
  • Sitemap entries for project detail pages if not already complete.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions