[HCA DCP] Add HCA projects to Google Datasets catalog

Add HCA DCP projects to the [Google Dataset Search](https://datasetsearch.research.google.com/) catalog so they're discoverable from Google. This is done by embedding [schema.org Dataset](https://schema.org/Dataset) JSON-LD on project detail pages — Google's crawler picks it up.

Companion to galaxyproject/brc-analytics#1264 (same approach, different catalog).

## Reference implementation

NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:

- Builder + types: [`app/utils/schemaOrg.ts`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/utils/schemaOrg.ts) — `SchemaDataset` interface and `buildStudyJsonLd()` factory.
- Render component: [`app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx) — wraps the JSON-LD in a `<script type="application/ld+json">` inside `next/head`, with HTML escaping to prevent script injection.
- Page integration: [`pages/[entityListType]/[...params].tsx`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/pages/%5BentityListType%5D/%5B...params%5D.tsx) — mounted on the detail route only.
- Tests: [`app/utils/schemaOrg.test.ts`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/utils/schemaOrg.test.ts) — covers required fields, truncation, and conditional fields.

## Google Dataset required + recommended fields

Per [Google's Dataset structured data guidelines](https://developers.google.com/search/docs/appearance/structured-data/dataset):

**Required**
- `name` — descriptive title
- `description` — 50–5000 characters

**Recommended**
- `identifier`, `url`, `sameAs`
- `creator`, `funder`, `license`
- `distribution` (with `contentUrl`, `encodingFormat`)
- `keywords`, `variableMeasured`, `measurementTechnique`
- `spatialCoverage`, `temporalCoverage`
- `includedInDataCatalog`, `isAccessibleForFree`, `version`, `citation`

## Initial mapping — HCA project (`ProjectResponse`) → `Dataset`

Source entity at `app/apis/azul/hca-dcp/common/entities.ts` (`ProjectResponse`).

| schema.org field | Source / value |
| --- | --- |
| `@context` | `"https://schema.org"` |
| `@type` | `"Dataset"` |
| `name` | `projectTitle` (fall back to `projectShortname`) |
| `description` | `projectDescription` — strip HTML, truncate to 5000 chars (min 50) |
| `identifier` | `[projectId, ...accessions]` (HCA project UUID plus mapped INSDC/GEO/ArrayExpress accessions) |
| `url` | `${browserURL}/projects/${projectId}` |
| `sameAs` | Accession URLs (e.g. GEO, ArrayExpress, INSDC) derived from `accessions` |
| `includedInDataCatalog` | `{ "@type": "DataCatalog", name: "Human Cell Atlas Data Coordination Platform", url: browserURL }` |
| `isAccessibleForFree` | `true` |
| `keywords` | Union of `genusSpecies`, `organ`, `organPart`, `disease`, `sampleEntityType`, `libraryConstructionApproach`, etc. |
| `creator` | Map `contributors` → `Person`/`Organization` (name, affiliation, role) |
| `funder` | Map funding sources from `projectResponse.funders` if present |
| `citation` | Map `publications` → `ScholarlyArticle` (title, DOI/URL) |
| `distribution` | `DataDownload[]` from matrix files / contributed analyses with `contentUrl` + `encodingFormat` |
| `variableMeasured` | Optional `PropertyValue[]` derived from `projectSummary` (cell counts, donor counts, file counts, library construction approach) |
| `license` | TBD — confirm with team (HCA data use terms) |

Open questions for `funder` / `license` / which file distributions to expose should be resolved before merge.

## Implementation steps

1. Add `app/utils/schemaOrg.ts` (or HCA-namespaced equivalent) with `SchemaDataset` types and `buildProjectJsonLd(project, browserURL)`.
2. Add a `ProjectJsonLd` component that renders the JSON-LD via `next/head` with the same HTML-escape helper as NCPI.
3. Mount the component on the project detail page (`pages/projects/[entityId]`).
4. Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null.
5. Validate output against [Google's Rich Results Test](https://search.google.com/test/rich-results) and [Schema Markup Validator](https://validator.schema.org/) for representative projects (single-cell, spatial, multi-organ).
6. Once shipped, request indexing via Google Search Console and confirm project pages start appearing in Google Dataset Search.

## Out of scope (follow-ups)

- JSON-LD on samples/files detail pages.
- Sitemap entries for project detail pages if not already complete.

schema.org field	Source / value
`@context`	`"https://schema.org"`
`@type`	`"Dataset"`
`name`	`projectTitle` (fall back to `projectShortname`)
`description`	`projectDescription` — strip HTML, truncate to 5000 chars (min 50)
`identifier`	`[projectId, ...accessions]` (HCA project UUID plus mapped INSDC/GEO/ArrayExpress accessions)
`url`	`${browserURL}/projects/${projectId}`
`sameAs`	Accession URLs (e.g. GEO, ArrayExpress, INSDC) derived from `accessions`
`includedInDataCatalog`	`{ "@type": "DataCatalog", name: "Human Cell Atlas Data Coordination Platform", url: browserURL }`
`isAccessibleForFree`	`true`
`keywords`	Union of `genusSpecies`, `organ`, `organPart`, `disease`, `sampleEntityType`, `libraryConstructionApproach`, etc.
`creator`	Map `contributors` → `Person`/`Organization` (name, affiliation, role)
`funder`	Map funding sources from `projectResponse.funders` if present
`citation`	Map `publications` → `ScholarlyArticle` (title, DOI/URL)
`distribution`	`DataDownload[]` from matrix files / contributed analyses with `contentUrl` + `encodingFormat`
`variableMeasured`	Optional `PropertyValue[]` derived from `projectSummary` (cell counts, donor counts, file counts, library construction approach)
`license`	TBD — confirm with team (HCA data use terms)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HCA DCP] Add HCA projects to Google Datasets catalog #4806

Reference implementation

Google Dataset required + recommended fields

Initial mapping — HCA project (`ProjectResponse`) → `Dataset`

Implementation steps

Out of scope (follow-ups)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[HCA DCP] Add HCA projects to Google Datasets catalog #4806

Description

Reference implementation

Google Dataset required + recommended fields

Initial mapping — HCA project (ProjectResponse) → Dataset

Implementation steps

Out of scope (follow-ups)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Initial mapping — HCA project (`ProjectResponse`) → `Dataset`