Add Schema.org Dataset JSON-LD and sitemap to catalog.ourworldindata.org

## Summary

Expose structured metadata for every catalog dataset as [Schema.org `Dataset`](https://schema.org/Dataset) JSON-LD, and serve a sitemap so Google Dataset Search (and other crawlers) can discover them passively.

## Motivation

OWID's catalog has rich per-indicator metadata (titles, descriptions, units, origins, licenses, topic tags) but none of it is machine-readable to external crawlers. Adding Schema.org markup would:

- Get every cataloged dataset into **Google Dataset Search** — the main structured-data discovery surface for datasets.
- Improve how OWID data appears in **regular Google results** (rich snippets).
- Provide a foundation for **Croissant** (ML-focused dataset standard) later — Croissant is a superset of Schema.org Dataset.
- Make datasets more visible to **LLM training crawlers** that parse structured data.

## Proposed approach

### 1. JSON-LD generator in ETL

Write a `meta_to_schema_org(meta_yml, dvc, catalog_path) -> dict` function that maps existing metadata to Schema.org fields:

| Schema.org field | Source |
|---|---|
| `name`, `description` | origin `title`, `description` from `.dvc` |
| `creator` | origin `producer` |
| `citation` | origin `citation_full` |
| `license` | origin `license.url` |
| `datePublished`, `version` | origin `date_published` + folder version |
| `identifier` | catalog path |
| `keywords` | `presentation.topic_tags` from `.meta.yml` |
| `variableMeasured` | per-variable `title`, `description_short`, `unit` from `.meta.yml` |
| `distribution` | CSV/zip download URLs |
| `temporalCoverage` | computable from data |
| `spatialCoverage` | computable from data or declared |

At publish time, emit `<catalog_path>/dataset.jsonld` alongside the data in R2.

### 2. Worker routes (in [owid/cloudflare-workers](https://github.com/owid/cloudflare-workers))

| Route | Response |
|---|---|
| `GET /sitemap.xml` | Auto-generated from R2 listing, one `<url>` per catalog dataset |
| `GET /robots.txt` | Points at sitemap |
| `GET /<catalog_path>/` | Minimal HTML shell embedding the JSON-LD (human-readable fallback + crawler target) |
| `GET /<catalog_path>/dataset.jsonld` | Raw JSON-LD with `Content-Type: application/ld+json` |

### 3. Search Console submission

- Submit sitemap in Google Search Console and Bing Webmaster Tools.
- Validate a sample of pages with [Rich Results Test](https://search.google.com/test/rich-results).

## Example output

For `biodiversity/2025-04-07/cherry_blossom`:

```json
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "Cherry Blossom Full Bloom Dates in Kyoto, Japan",
  "description": "A historical time series of peak cherry blossom bloom data from Kyoto, Japan...",
  "identifier": "biodiversity/2025-04-07/cherry_blossom",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "creator": { "@type": "Person", "name": "Yasuyuki Aono" },
  "publisher": { "@type": "Organization", "name": "Our World in Data" },
  "keywords": ["Biodiversity", "Climate Change"],
  "datePublished": "2025-04-07",
  "temporalCoverage": "0812/2025",
  "variableMeasured": [
    {
      "@type": "PropertyValue",
      "name": "full_flowering_date",
      "description": "Day of the year with peak cherry blossom of Prunus jamasakura in Kyoto, Japan.",
      "unitText": "day of the year"
    },
    {
      "@type": "PropertyValue",
      "name": "average_20_years",
      "description": "Twenty-year moving average of the peak-bloom day-of-year. Only computed for 20-year windows with at least five years of data.",
      "unitText": "day of the year"
    }
  ],
  "distribution": [
    {
      "@type": "DataDownload",
      "encodingFormat": "text/csv",
      "contentUrl": "https://catalog.ourworldindata.org/garden/biodiversity/2025-04-07/cherry_blossom/cherry_blossom.csv"
    }
  ]
}
```

## Validation

- Use Google [Rich Results Test](https://search.google.com/test/rich-results) on sample pages.
- Use [Schema Markup Validator](https://validator.schema.org/) for structural checks.
- Required fields for Google Dataset Search eligibility: `name`, `description`, `license`.

## Open questions

- Should the HTML shell at `/<catalog_path>/` be a full landing page or minimal? (Minimal is fine for crawlers; a richer page could replace the current raw file listing.)
- Should `temporalCoverage` and `spatialCoverage` be precomputed at ETL publish time or derived lazily by the Worker?
- Do we want `sameAs` links to grapher URLs for charted indicators?

Schema.org field	Source
`name`, `description`	origin `title`, `description` from `.dvc`
`creator`	origin `producer`
`citation`	origin `citation_full`
`license`	origin `license.url`
`datePublished`, `version`	origin `date_published` + folder version
`identifier`	catalog path
`keywords`	`presentation.topic_tags` from `.meta.yml`
`variableMeasured`	per-variable `title`, `description_short`, `unit` from `.meta.yml`
`distribution`	CSV/zip download URLs
`temporalCoverage`	computable from data
`spatialCoverage`	computable from data or declared

Route	Response
`GET /sitemap.xml`	Auto-generated from R2 listing, one `<url>` per catalog dataset
`GET /robots.txt`	Points at sitemap
`GET /<catalog_path>/`	Minimal HTML shell embedding the JSON-LD (human-readable fallback + crawler target)
`GET /<catalog_path>/dataset.jsonld`	Raw JSON-LD with `Content-Type: application/ld+json`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Schema.org Dataset JSON-LD and sitemap to catalog.ourworldindata.org #5923

Summary

Motivation

Proposed approach

1. JSON-LD generator in ETL

2. Worker routes (in owid/cloudflare-workers)

3. Search Console submission

Example output

Validation

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add Schema.org Dataset JSON-LD and sitemap to catalog.ourworldindata.org #5923

Description

Summary

Motivation

Proposed approach

1. JSON-LD generator in ETL

2. Worker routes (in owid/cloudflare-workers)

3. Search Console submission

Example output

Validation

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions