Skip to content

Add Schema.org Dataset JSON-LD and sitemap to catalog.ourworldindata.org #5923

@Marigold

Description

@Marigold

Summary

Expose structured metadata for every catalog dataset as Schema.org Dataset JSON-LD, and serve a sitemap so Google Dataset Search (and other crawlers) can discover them passively.

Motivation

OWID's catalog has rich per-indicator metadata (titles, descriptions, units, origins, licenses, topic tags) but none of it is machine-readable to external crawlers. Adding Schema.org markup would:

  • Get every cataloged dataset into Google Dataset Search — the main structured-data discovery surface for datasets.
  • Improve how OWID data appears in regular Google results (rich snippets).
  • Provide a foundation for Croissant (ML-focused dataset standard) later — Croissant is a superset of Schema.org Dataset.
  • Make datasets more visible to LLM training crawlers that parse structured data.

Proposed approach

1. JSON-LD generator in ETL

Write a meta_to_schema_org(meta_yml, dvc, catalog_path) -> dict function that maps existing metadata to Schema.org fields:

Schema.org field Source
name, description origin title, description from .dvc
creator origin producer
citation origin citation_full
license origin license.url
datePublished, version origin date_published + folder version
identifier catalog path
keywords presentation.topic_tags from .meta.yml
variableMeasured per-variable title, description_short, unit from .meta.yml
distribution CSV/zip download URLs
temporalCoverage computable from data
spatialCoverage computable from data or declared

At publish time, emit <catalog_path>/dataset.jsonld alongside the data in R2.

2. Worker routes (in owid/cloudflare-workers)

Route Response
GET /sitemap.xml Auto-generated from R2 listing, one <url> per catalog dataset
GET /robots.txt Points at sitemap
GET /<catalog_path>/ Minimal HTML shell embedding the JSON-LD (human-readable fallback + crawler target)
GET /<catalog_path>/dataset.jsonld Raw JSON-LD with Content-Type: application/ld+json

3. Search Console submission

  • Submit sitemap in Google Search Console and Bing Webmaster Tools.
  • Validate a sample of pages with Rich Results Test.

Example output

For biodiversity/2025-04-07/cherry_blossom:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "Cherry Blossom Full Bloom Dates in Kyoto, Japan",
  "description": "A historical time series of peak cherry blossom bloom data from Kyoto, Japan...",
  "identifier": "biodiversity/2025-04-07/cherry_blossom",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "creator": { "@type": "Person", "name": "Yasuyuki Aono" },
  "publisher": { "@type": "Organization", "name": "Our World in Data" },
  "keywords": ["Biodiversity", "Climate Change"],
  "datePublished": "2025-04-07",
  "temporalCoverage": "0812/2025",
  "variableMeasured": [
    {
      "@type": "PropertyValue",
      "name": "full_flowering_date",
      "description": "Day of the year with peak cherry blossom of Prunus jamasakura in Kyoto, Japan.",
      "unitText": "day of the year"
    },
    {
      "@type": "PropertyValue",
      "name": "average_20_years",
      "description": "Twenty-year moving average of the peak-bloom day-of-year. Only computed for 20-year windows with at least five years of data.",
      "unitText": "day of the year"
    }
  ],
  "distribution": [
    {
      "@type": "DataDownload",
      "encodingFormat": "text/csv",
      "contentUrl": "https://catalog.ourworldindata.org/garden/biodiversity/2025-04-07/cherry_blossom/cherry_blossom.csv"
    }
  ]
}

Validation

Open questions

  • Should the HTML shell at /<catalog_path>/ be a full landing page or minimal? (Minimal is fine for crawlers; a richer page could replace the current raw file listing.)
  • Should temporalCoverage and spatialCoverage be precomputed at ETL publish time or derived lazily by the Worker?
  • Do we want sameAs links to grapher URLs for charted indicators?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions