Add LungMAP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264, #4806 (HCA), and #4807 (AnVIL). LungMAP shares the HCA Azul backend, so the implementation should follow the HCA companion ticket closely with LungMAP-specific catalog naming.
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name — descriptive title
description — 50–5000 characters
Recommended
identifier, url, sameAs
creator, funder, license
distribution (with contentUrl, encodingFormat)
keywords, variableMeasured, measurementTechnique
spatialCoverage, temporalCoverage
includedInDataCatalog, isAccessibleForFree, version, citation
Initial mapping — LungMAP project (ProjectResponse) → Dataset
Source entity at app/apis/azul/hca-dcp/common/entities.ts (ProjectResponse, shared with HCA via the lm2 catalog).
| schema.org field |
Source / value |
@context |
"https://schema.org" |
@type |
"Dataset" |
name |
projectTitle (fall back to projectShortname) |
description |
projectDescription — strip HTML, truncate to 5000 chars (min 50) |
identifier |
[projectId, ...accessions] |
url |
${browserURL}/projects/${projectId} |
sameAs |
Accession URLs (GEO, ArrayExpress, INSDC, etc.) derived from accessions |
includedInDataCatalog |
{ "@type": "DataCatalog", name: "LungMAP Data Explorer", url: browserURL } |
isAccessibleForFree |
true |
keywords |
Union of genusSpecies, organ (focused on lung anatomy), organPart, disease, sampleEntityType, libraryConstructionApproach, developmentStage |
creator |
Map contributors → Person/Organization (name, affiliation, role) |
funder |
Likely { "@type": "Organization", name: "NHLBI LungMAP" } plus per-project funders if available |
citation |
Map publications → ScholarlyArticle (title, DOI/URL) |
distribution |
DataDownload[] from matrix files / contributed analyses with contentUrl + encodingFormat |
variableMeasured |
Optional PropertyValue[] derived from projectSummary (cell counts, donor counts, file counts, library construction approach, development stage) |
license |
TBD — confirm with team (LungMAP data use terms) |
Open questions for funder / license should be resolved before merge.
Implementation steps
- Add a LungMAP-aware
buildProjectJsonLd(project, browserURL, catalog) (likely shared with the HCA implementation, parameterized on catalog name/URL).
- Add a
ProjectJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
- Mount the component on the LungMAP project detail page (
pages/projects/[entityId] under the LungMAP site config).
- Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null, LungMAP catalog name asserted.
- Validate output against Google's Rich Results Test and Schema Markup Validator for representative LungMAP projects (mouse vs. human, single-cell vs. spatial).
- Once shipped, request indexing via Google Search Console and confirm project pages start appearing in Google Dataset Search.
Out of scope (follow-ups)
- JSON-LD on samples/files detail pages.
- Sitemap entries for project detail pages if not already complete.
Add LungMAP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264, #4806 (HCA), and #4807 (AnVIL). LungMAP shares the HCA Azul backend, so the implementation should follow the HCA companion ticket closely with LungMAP-specific catalog naming.
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
app/utils/schemaOrg.ts—SchemaDatasetinterface andbuildStudyJsonLd()factory.app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx— wraps the JSON-LD in a<script type="application/ld+json">insidenext/head, with HTML escaping to prevent script injection.pages/[entityListType]/[...params].tsx— mounted on the detail route only.app/utils/schemaOrg.test.ts— covers required fields, truncation, and conditional fields.Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name— descriptive titledescription— 50–5000 charactersRecommended
identifier,url,sameAscreator,funder,licensedistribution(withcontentUrl,encodingFormat)keywords,variableMeasured,measurementTechniquespatialCoverage,temporalCoverageincludedInDataCatalog,isAccessibleForFree,version,citationInitial mapping — LungMAP project (
ProjectResponse) →DatasetSource entity at
app/apis/azul/hca-dcp/common/entities.ts(ProjectResponse, shared with HCA via thelm2catalog).@context"https://schema.org"@type"Dataset"nameprojectTitle(fall back toprojectShortname)descriptionprojectDescription— strip HTML, truncate to 5000 chars (min 50)identifier[projectId, ...accessions]url${browserURL}/projects/${projectId}sameAsaccessionsincludedInDataCatalog{ "@type": "DataCatalog", name: "LungMAP Data Explorer", url: browserURL }isAccessibleForFreetruekeywordsgenusSpecies,organ(focused on lung anatomy),organPart,disease,sampleEntityType,libraryConstructionApproach,developmentStagecreatorcontributors→Person/Organization(name, affiliation, role)funder{ "@type": "Organization", name: "NHLBI LungMAP" }plus per-project funders if availablecitationpublications→ScholarlyArticle(title, DOI/URL)distributionDataDownload[]from matrix files / contributed analyses withcontentUrl+encodingFormatvariableMeasuredPropertyValue[]derived fromprojectSummary(cell counts, donor counts, file counts, library construction approach, development stage)licenseOpen questions for
funder/licenseshould be resolved before merge.Implementation steps
buildProjectJsonLd(project, browserURL, catalog)(likely shared with the HCA implementation, parameterized on catalog name/URL).ProjectJsonLdcomponent that renders the JSON-LD vianext/headwith the same HTML-escape helper as NCPI.pages/projects/[entityId]under the LungMAP site config).Out of scope (follow-ups)