You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OWID ETL is gradually replacing the legacy meta.source block in snapshot
DVCs with the modern meta.origin block. The migration has been running
in waves driven by the migrate-source-to-origin skill (see .claude/skills/migrate-source-to-origin/SKILL.md), with the skill itself
being tightened over time as we learned where it produces poor output.
Work in progress
Two open PRs cover the bulk of the remaining migrations:
Migrate Source → Origin for dag/migrated.yml snapshots #5978 — data-migrate-defra-air: covers the snapshots referenced
by dag/migrated.yml (the legacy-OWID-datasets bucket). Includes
re-migrations of files where earlier skill versions had paraphrased
descriptions, fabricated content for empty source descriptions, left
HTML in origin fields, or used Various sources where producers were
in fact named.
The migrations are AI-generated and the skill, while better than it
was, is not perfect. Before either PR is merged, every changed
DVC needs a human pass to check:
Producer: matches the actual data publisher (institution or
authors), not OWID's internal label, and not generic Various sources when ≤3 producers are named in the legacy.
Title vs title_snapshot: STEP 1 of the skill ("does the data
product coincide with this snapshot?") is a judgment call; the agent
often picks coincide when the legacy implies a multi-product
database slice (and vice versa).
description is verbatim from the legacy, not paraphrased,
fabricated, or rewritten.
URLs preserved: legacy source.url prose blobs must be
extracted into url_main/url_download, not silently dropped or
left as raw HTML in description.
citation_full: producer's preferred citation, period at end,
no curly quotes, no &.
date_accessed: not the migration tool's "today's date"
default — should match the snapshot's intent.
What's left after these PRs
Once #5978 and #6027 land, the remaining files with legacy meta.source blocks (mostly active-but-untouched snapshots outside dag/migrated.yml) will need their own pass — likely incremental,
file-by-file as snapshots get touched for unrelated reasons.
Background
OWID ETL is gradually replacing the legacy
meta.sourceblock in snapshotDVCs with the modern
meta.originblock. The migration has been runningin waves driven by the
migrate-source-to-originskill (see.claude/skills/migrate-source-to-origin/SKILL.md), with the skill itselfbeing tightened over time as we learned where it produces poor output.
Work in progress
Two open PRs cover the bulk of the remaining migrations:
data-migrate-defra-air: covers the snapshots referencedby
dag/migrated.yml(the legacy-OWID-datasets bucket). Includesre-migrations of files where earlier skill versions had paraphrased
descriptions, fabricated content for empty source descriptions, left
HTML in origin fields, or used
Various sourceswhere producers werein fact named.
data-migrate-complex-sourceorigin: targets thetrickier compilations and multi-product database slices that the
skill flagged as needing closer attention.
Manual review is required
The migrations are AI-generated and the skill, while better than it
was, is not perfect. Before either PR is merged, every changed
DVC needs a human pass to check:
authors), not OWID's internal label, and not generic
Various sourceswhen ≤3 producers are named in the legacy.title_snapshot: STEP 1 of the skill ("does the dataproduct coincide with this snapshot?") is a judgment call; the agent
often picks coincide when the legacy implies a multi-product
database slice (and vice versa).
descriptionis verbatim from the legacy, not paraphrased,fabricated, or rewritten.
source.urlprose blobs must beextracted into
url_main/url_download, not silently dropped orleft as raw HTML in description.
citation_full: producer's preferred citation, period at end,no curly quotes, no
&.date_accessed: not the migration tool's "today's date"default — should match the snapshot's intent.
What's left after these PRs
Once #5978 and #6027 land, the remaining files with legacy
meta.sourceblocks (mostly active-but-untouched snapshots outsidedag/migrated.yml) will need their own pass — likely incremental,file-by-file as snapshots get touched for unrelated reasons.