Summary
Expose structured metadata for every catalog dataset as Schema.org Dataset JSON-LD, and serve a sitemap so Google Dataset Search (and other crawlers) can discover them passively.
Motivation
OWID's catalog has rich per-indicator metadata (titles, descriptions, units, origins, licenses, topic tags) but none of it is machine-readable to external crawlers. Adding Schema.org markup would:
- Get every cataloged dataset into Google Dataset Search — the main structured-data discovery surface for datasets.
- Improve how OWID data appears in regular Google results (rich snippets).
- Provide a foundation for Croissant (ML-focused dataset standard) later — Croissant is a superset of Schema.org Dataset.
- Make datasets more visible to LLM training crawlers that parse structured data.
Proposed approach
1. JSON-LD generator in ETL
Write a meta_to_schema_org(meta_yml, dvc, catalog_path) -> dict function that maps existing metadata to Schema.org fields:
| Schema.org field |
Source |
name, description |
origin title, description from .dvc |
creator |
origin producer |
citation |
origin citation_full |
license |
origin license.url |
datePublished, version |
origin date_published + folder version |
identifier |
catalog path |
keywords |
presentation.topic_tags from .meta.yml |
variableMeasured |
per-variable title, description_short, unit from .meta.yml |
distribution |
CSV/zip download URLs |
temporalCoverage |
computable from data |
spatialCoverage |
computable from data or declared |
At publish time, emit <catalog_path>/dataset.jsonld alongside the data in R2.
| Route |
Response |
GET /sitemap.xml |
Auto-generated from R2 listing, one <url> per catalog dataset |
GET /robots.txt |
Points at sitemap |
GET /<catalog_path>/ |
Minimal HTML shell embedding the JSON-LD (human-readable fallback + crawler target) |
GET /<catalog_path>/dataset.jsonld |
Raw JSON-LD with Content-Type: application/ld+json |
3. Search Console submission
- Submit sitemap in Google Search Console and Bing Webmaster Tools.
- Validate a sample of pages with Rich Results Test.
Example output
For biodiversity/2025-04-07/cherry_blossom:
{
"@context": "https://schema.org/",
"@type": "Dataset",
"name": "Cherry Blossom Full Bloom Dates in Kyoto, Japan",
"description": "A historical time series of peak cherry blossom bloom data from Kyoto, Japan...",
"identifier": "biodiversity/2025-04-07/cherry_blossom",
"license": "https://creativecommons.org/licenses/by/4.0/",
"creator": { "@type": "Person", "name": "Yasuyuki Aono" },
"publisher": { "@type": "Organization", "name": "Our World in Data" },
"keywords": ["Biodiversity", "Climate Change"],
"datePublished": "2025-04-07",
"temporalCoverage": "0812/2025",
"variableMeasured": [
{
"@type": "PropertyValue",
"name": "full_flowering_date",
"description": "Day of the year with peak cherry blossom of Prunus jamasakura in Kyoto, Japan.",
"unitText": "day of the year"
},
{
"@type": "PropertyValue",
"name": "average_20_years",
"description": "Twenty-year moving average of the peak-bloom day-of-year. Only computed for 20-year windows with at least five years of data.",
"unitText": "day of the year"
}
],
"distribution": [
{
"@type": "DataDownload",
"encodingFormat": "text/csv",
"contentUrl": "https://catalog.ourworldindata.org/garden/biodiversity/2025-04-07/cherry_blossom/cherry_blossom.csv"
}
]
}
Validation
Open questions
- Should the HTML shell at
/<catalog_path>/ be a full landing page or minimal? (Minimal is fine for crawlers; a richer page could replace the current raw file listing.)
- Should
temporalCoverage and spatialCoverage be precomputed at ETL publish time or derived lazily by the Worker?
- Do we want
sameAs links to grapher URLs for charted indicators?
Summary
Expose structured metadata for every catalog dataset as Schema.org
DatasetJSON-LD, and serve a sitemap so Google Dataset Search (and other crawlers) can discover them passively.Motivation
OWID's catalog has rich per-indicator metadata (titles, descriptions, units, origins, licenses, topic tags) but none of it is machine-readable to external crawlers. Adding Schema.org markup would:
Proposed approach
1. JSON-LD generator in ETL
Write a
meta_to_schema_org(meta_yml, dvc, catalog_path) -> dictfunction that maps existing metadata to Schema.org fields:name,descriptiontitle,descriptionfrom.dvccreatorproducercitationcitation_fulllicenselicense.urldatePublished,versiondate_published+ folder versionidentifierkeywordspresentation.topic_tagsfrom.meta.ymlvariableMeasuredtitle,description_short,unitfrom.meta.ymldistributiontemporalCoveragespatialCoverageAt publish time, emit
<catalog_path>/dataset.jsonldalongside the data in R2.2. Worker routes (in owid/cloudflare-workers)
GET /sitemap.xml<url>per catalog datasetGET /robots.txtGET /<catalog_path>/GET /<catalog_path>/dataset.jsonldContent-Type: application/ld+json3. Search Console submission
Example output
For
biodiversity/2025-04-07/cherry_blossom:{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Cherry Blossom Full Bloom Dates in Kyoto, Japan", "description": "A historical time series of peak cherry blossom bloom data from Kyoto, Japan...", "identifier": "biodiversity/2025-04-07/cherry_blossom", "license": "https://creativecommons.org/licenses/by/4.0/", "creator": { "@type": "Person", "name": "Yasuyuki Aono" }, "publisher": { "@type": "Organization", "name": "Our World in Data" }, "keywords": ["Biodiversity", "Climate Change"], "datePublished": "2025-04-07", "temporalCoverage": "0812/2025", "variableMeasured": [ { "@type": "PropertyValue", "name": "full_flowering_date", "description": "Day of the year with peak cherry blossom of Prunus jamasakura in Kyoto, Japan.", "unitText": "day of the year" }, { "@type": "PropertyValue", "name": "average_20_years", "description": "Twenty-year moving average of the peak-bloom day-of-year. Only computed for 20-year windows with at least five years of data.", "unitText": "day of the year" } ], "distribution": [ { "@type": "DataDownload", "encodingFormat": "text/csv", "contentUrl": "https://catalog.ourworldindata.org/garden/biodiversity/2025-04-07/cherry_blossom/cherry_blossom.csv" } ] }Validation
name,description,license.Open questions
/<catalog_path>/be a full landing page or minimal? (Minimal is fine for crawlers; a richer page could replace the current raw file listing.)temporalCoverageandspatialCoveragebe precomputed at ETL publish time or derived lazily by the Worker?sameAslinks to grapher URLs for charted indicators?