Add HCA DCP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264 (same approach, different catalog).
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name — descriptive title
description — 50–5000 characters
Recommended
identifier, url, sameAs
creator, funder, license
distribution (with contentUrl, encodingFormat)
keywords, variableMeasured, measurementTechnique
spatialCoverage, temporalCoverage
includedInDataCatalog, isAccessibleForFree, version, citation
Initial mapping — HCA project (ProjectResponse) → Dataset
Source entity at app/apis/azul/hca-dcp/common/entities.ts (ProjectResponse).
| schema.org field |
Source / value |
@context |
"https://schema.org" |
@type |
"Dataset" |
name |
projectTitle (fall back to projectShortname) |
description |
projectDescription — strip HTML, truncate to 5000 chars (min 50) |
identifier |
[projectId, ...accessions] (HCA project UUID plus mapped INSDC/GEO/ArrayExpress accessions) |
url |
${browserURL}/projects/${projectId} |
sameAs |
Accession URLs (e.g. GEO, ArrayExpress, INSDC) derived from accessions |
includedInDataCatalog |
{ "@type": "DataCatalog", name: "Human Cell Atlas Data Coordination Platform", url: browserURL } |
isAccessibleForFree |
true |
keywords |
Union of genusSpecies, organ, organPart, disease, sampleEntityType, libraryConstructionApproach, etc. |
creator |
Map contributors → Person/Organization (name, affiliation, role) |
funder |
Map funding sources from projectResponse.funders if present |
citation |
Map publications → ScholarlyArticle (title, DOI/URL) |
distribution |
DataDownload[] from matrix files / contributed analyses with contentUrl + encodingFormat |
variableMeasured |
Optional PropertyValue[] derived from projectSummary (cell counts, donor counts, file counts, library construction approach) |
license |
TBD — confirm with team (HCA data use terms) |
Open questions for funder / license / which file distributions to expose should be resolved before merge.
Implementation steps
- Add
app/utils/schemaOrg.ts (or HCA-namespaced equivalent) with SchemaDataset types and buildProjectJsonLd(project, browserURL).
- Add a
ProjectJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
- Mount the component on the project detail page (
pages/projects/[entityId]).
- Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null.
- Validate output against Google's Rich Results Test and Schema Markup Validator for representative projects (single-cell, spatial, multi-organ).
- Once shipped, request indexing via Google Search Console and confirm project pages start appearing in Google Dataset Search.
Out of scope (follow-ups)
- JSON-LD on samples/files detail pages.
- Sitemap entries for project detail pages if not already complete.
Add HCA DCP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264 (same approach, different catalog).
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
app/utils/schemaOrg.ts—SchemaDatasetinterface andbuildStudyJsonLd()factory.app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx— wraps the JSON-LD in a<script type="application/ld+json">insidenext/head, with HTML escaping to prevent script injection.pages/[entityListType]/[...params].tsx— mounted on the detail route only.app/utils/schemaOrg.test.ts— covers required fields, truncation, and conditional fields.Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name— descriptive titledescription— 50–5000 charactersRecommended
identifier,url,sameAscreator,funder,licensedistribution(withcontentUrl,encodingFormat)keywords,variableMeasured,measurementTechniquespatialCoverage,temporalCoverageincludedInDataCatalog,isAccessibleForFree,version,citationInitial mapping — HCA project (
ProjectResponse) →DatasetSource entity at
app/apis/azul/hca-dcp/common/entities.ts(ProjectResponse).@context"https://schema.org"@type"Dataset"nameprojectTitle(fall back toprojectShortname)descriptionprojectDescription— strip HTML, truncate to 5000 chars (min 50)identifier[projectId, ...accessions](HCA project UUID plus mapped INSDC/GEO/ArrayExpress accessions)url${browserURL}/projects/${projectId}sameAsaccessionsincludedInDataCatalog{ "@type": "DataCatalog", name: "Human Cell Atlas Data Coordination Platform", url: browserURL }isAccessibleForFreetruekeywordsgenusSpecies,organ,organPart,disease,sampleEntityType,libraryConstructionApproach, etc.creatorcontributors→Person/Organization(name, affiliation, role)funderprojectResponse.fundersif presentcitationpublications→ScholarlyArticle(title, DOI/URL)distributionDataDownload[]from matrix files / contributed analyses withcontentUrl+encodingFormatvariableMeasuredPropertyValue[]derived fromprojectSummary(cell counts, donor counts, file counts, library construction approach)licenseOpen questions for
funder/license/ which file distributions to expose should be resolved before merge.Implementation steps
app/utils/schemaOrg.ts(or HCA-namespaced equivalent) withSchemaDatasettypes andbuildProjectJsonLd(project, browserURL).ProjectJsonLdcomponent that renders the JSON-LD vianext/headwith the same HTML-escape helper as NCPI.pages/projects/[entityId]).Out of scope (follow-ups)