Skip to content

Finish source→origin migration for legacy snapshots #6045

@Marigold

Description

@Marigold

Background

OWID ETL is gradually replacing the legacy meta.source block in snapshot
DVCs with the modern meta.origin block. The migration has been running
in waves driven by the migrate-source-to-origin skill (see
.claude/skills/migrate-source-to-origin/SKILL.md), with the skill itself
being tightened over time as we learned where it produces poor output.

Work in progress

Two open PRs cover the bulk of the remaining migrations:

  • Migrate Source → Origin for dag/migrated.yml snapshots #5978data-migrate-defra-air: covers the snapshots referenced
    by dag/migrated.yml (the legacy-OWID-datasets bucket). Includes
    re-migrations of files where earlier skill versions had paraphrased
    descriptions, fabricated content for empty source descriptions, left
    HTML in origin fields, or used Various sources where producers were
    in fact named.
  • 📊 Migrate excess_mortality snapshots to meta.origin #6027data-migrate-complex-sourceorigin: targets the
    trickier compilations and multi-product database slices that the
    skill flagged as needing closer attention.

Manual review is required

The migrations are AI-generated and the skill, while better than it
was, is not perfect. Before either PR is merged, every changed
DVC needs a human pass to check:

  • Producer: matches the actual data publisher (institution or
    authors), not OWID's internal label, and not generic Various sources when ≤3 producers are named in the legacy.
  • Title vs title_snapshot: STEP 1 of the skill ("does the data
    product coincide with this snapshot?") is a judgment call; the agent
    often picks coincide when the legacy implies a multi-product
    database slice (and vice versa).
  • description is verbatim from the legacy, not paraphrased,
    fabricated, or rewritten.
  • URLs preserved: legacy source.url prose blobs must be
    extracted into url_main/url_download, not silently dropped or
    left as raw HTML in description.
  • citation_full: producer's preferred citation, period at end,
    no curly quotes, no &.
  • date_accessed: not the migration tool's "today's date"
    default — should match the snapshot's intent.

What's left after these PRs

Once #5978 and #6027 land, the remaining files with legacy
meta.source blocks (mostly active-but-untouched snapshots outside
dag/migrated.yml) will need their own pass — likely incremental,
file-by-file as snapshots get touched for unrelated reasons.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions