Skip to content

Finish active Source to Origin metadata migration #5979

@Marigold

Description

@Marigold

Context

Source metadata is deprecated in favor of Origin. Origins are richer, live at indicator level, and are uploaded to Grapher via origins / origins_variables; legacy Sources use variables.sourceId and sources.

PR #5978 migrates all active snapshot DVC legacy source: / source_name: metadata referenced from dag/migrated.yml to origin: and simplifies those backported snapshot scripts so they no longer use SnapshotMeta, snap_config, or fill_from_backport_snapshot.

This issue tracks the remaining active-DAG Source → Origin migration work after #5978. Scope should stay on active DAG files first; repo-wide inactive snapshots are much larger and lower priority.

Current active-DAG status after #5978

  • Active legacy snapshot DVC files outside migrated snapshots: 73
  • Active metadata files with plural sources:: 35 (72 keys)
  • Active step code files with Source / .sources-style handling: 39
  • dag/migrated.yml legacy snapshot DVC files remaining: 0
  • Repo-wide snapshot DVC files with legacy source: / source_name: (mostly inactive): 6,519

Remaining active legacy snapshot DVC files

health.yml (19)

  • snapshots/fasttrack/2023-04-30/paratz.csv.dvc
  • snapshots/fasttrack/2023-05-31/cholera.csv.dvc
  • snapshots/fasttrack/2024-06-17/guinea_worm.csv.dvc
  • snapshots/health/2023-04-18/wgm_mental_health.zip.dvc
  • snapshots/health/2023-04-25/wgm_2018.xlsx.dvc
  • snapshots/health/2023-05-04/global_wellbeing.xlsx.dvc
  • snapshots/health/2023-08-22/unaids_deaths_averted_art.xlsx.dvc
  • snapshots/oecd/2018-03-11/road_deaths_and_injuries.feather.dvc
  • snapshots/oecd/2023-05-01/health_pharma_market.csv.dvc
  • snapshots/postnatal_care/2022-09-19/postnatal_care.csv.dvc
  • snapshots/unicef/2023-06-16/diarrhea.xlsx.dvc
  • snapshots/who/2022-09-01/autopsy.csv.dvc
  • snapshots/who/2023-04-03/flu_elderly.xlsx.dvc
  • snapshots/who/2023-04-03/flu_vaccine_policy.xlsx.dvc
  • snapshots/who/2023-06-29/guinea_worm.csv.dvc
  • snapshots/who/2023-07-14/standard_age_distribution.csv.dvc
  • snapshots/who/2025-08-01/guinea_worm.csv.dvc
  • snapshots/who/latest/fluid.csv.dvc
  • snapshots/who/latest/flunet.csv.dvc

environment.yml (10)

  • snapshots/unep/2023-03-17/consumption_controlled_substances.bromochloromethane.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.carbon_tetrachloride.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.chlorofluorocarbons.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.halons.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.hydrobromofluorocarbons.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.hydrochlorofluorocarbons.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.hydrofluorocarbons.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.methyl_bromide.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.methyl_chloroform.xlsx.dvc
  • snapshots/unep/2023-03-17/consumption_controlled_substances.other_fully_halogenated.xlsx.dvc

main.yml (10)

  • snapshots/fasttrack/2023-01-03/long_term_homicide_rates_in_europe.csv.dvc
  • snapshots/papers/2023-06-07/commodity_prices.xlsx.dvc
  • snapshots/research_development/2023-05-24/us_patents.htm.dvc
  • snapshots/technology/2023-03-08/microprocessor_trend.dat.dvc
  • snapshots/technology/2023-03-16/hcctad.txt.dvc
  • snapshots/un/2023-08-16/un_sdg.feather.dvc
  • snapshots/un/2023-08-16/un_sdg_dimension.json.dvc
  • snapshots/un/2023-08-16/un_sdg_unit.csv.dvc
  • snapshots/wb/2021-07-01/wb_income.xlsx.dvc
  • snapshots/wvs/2023-06-25/longitudinal_wvs.csv.dvc

fasttrack.yml (9)

  • snapshots/fasttrack/2023-06-19/world_population_comparison.csv.dvc
  • snapshots/fasttrack/2023-08-07/pain_hours_days_hen_systems.csv.dvc
  • snapshots/fasttrack/2023-10-05/great_pacific_garbage_lebreton.csv.dvc
  • snapshots/fasttrack/latest/baxter_2013_gbd_adult_coverage.csv.dvc
  • snapshots/fasttrack/latest/democracy_freedom_house.csv.dvc
  • snapshots/fasttrack/latest/global_maternal_offspring_loss.csv.dvc
  • snapshots/fasttrack/latest/treatment_gap_anxiety_disorders_world_mental_health_surveys.csv.dvc
  • snapshots/fasttrack/latest/under_five_mortality_lmics.csv.dvc
  • snapshots/fasttrack/latest/whm_treatment_gap_anxiety_disorders.csv.dvc

war.yml (7)

  • snapshots/war/2023-01-09/bouthoul_carrere_1978.csv.dvc
  • snapshots/war/2023-01-09/clodfelter_2017.csv.dvc
  • snapshots/war/2023-01-09/dunnigan_martel_1987.csv.dvc
  • snapshots/war/2023-01-09/eckhardt_1991.csv.dvc
  • snapshots/war/2023-01-09/kaye_1985.csv.dvc
  • snapshots/war/2023-01-09/sorokin_1937.csv.dvc
  • snapshots/war/2023-01-09/sutton_1971.csv.dvc

education.yml (5)

  • snapshots/education/2023-08-09/numeracy.xlsx.dvc
  • snapshots/education/2023-08-09/numeracy_gender.xlsx.dvc
  • snapshots/education/2023-08-09/years_of_education.xlsx.dvc
  • snapshots/education/2023-08-09/years_of_education_gender.xlsx.dvc
  • snapshots/education/2023-08-09/years_of_education_gini.xlsx.dvc

covid.yml (4)

  • snapshots/excess_mortality/latest/hmd_stmf.csv.dvc
  • snapshots/excess_mortality/latest/wmd.csv.dvc
  • snapshots/excess_mortality/latest/xm_karlinsky_kobak.csv.dvc
  • snapshots/excess_mortality/latest/xm_karlinsky_kobak_ages.csv.dvc

demography.yml (4)

  • snapshots/fasttrack/2023-06-19/world_population_comparison.csv.dvc
  • snapshots/hyde/2017/baseline.zip.dvc
  • snapshots/hyde/2017/general_files.zip.dvc
  • snapshots/un/2022-07-11/un_wpp.zip.dvc

artificial_intelligence.yml (2)

  • snapshots/artificial_intelligence/2023-07-07/semiconductors_cset.csv.dvc
  • snapshots/world_risk_poll/2023-06-26/wrp_2021.zip.dvc

emissions.yml (2)

  • snapshots/andrew/2019-12-03/co2_mitigation_curves_1p5celsius.csv.dvc
  • snapshots/andrew/2019-12-03/co2_mitigation_curves_2celsius.csv.dvc

agriculture.yml (1)

  • snapshots/usda_ers/2023-06-07/food_expenditure_since_2017.xlsx.dvc

biodiversity.yml (1)

  • snapshots/biodiversity/2021-01-01/habitat_loss.feather.dvc

Remaining active metadata files with sources:

health.yml (8)

  • etl/steps/data/garden/postnatal_care/2022-09-19/postnatal_care.meta.yml (1 sources: keys)
  • etl/steps/data/garden/who/2023-06-01/cholera.meta.yml (1 sources: keys)
  • etl/steps/data/garden/who/2023-06-29/guinea_worm_certification.meta.yml (1 sources: keys)
  • etl/steps/data/garden/who/2023-07-13/autopsy.meta.yml (1 sources: keys)
  • etl/steps/data/grapher/postnatal_care/2022-09-19/postnatal_care.meta.yml (1 sources: keys)
  • etl/steps/data/grapher/who/2023-07-13/autopsy.meta.yml (1 sources: keys)
  • etl/steps/data/meadow/postnatal_care/2022-09-19/postnatal_care.meta.yml (1 sources: keys)
  • etl/steps/data/meadow/unicef/2023-06-16/diarrhea.meta.yml (1 sources: keys)

migrated.yml (7)

  • etl/steps/data/garden/clio_infra/2017-09-09/clio_infra__biological_standards_of_living__baten_and_blum__2015.meta.yml (2 sources: keys)
  • etl/steps/data/garden/clio_infra/2017-09-09/clio_infra__human_capital.meta.yml (6 sources: keys)
  • etl/steps/data/garden/waste/2018-02-15/waste_production_and_management.meta.yml (6 sources: keys)
  • etl/steps/data/garden/worldbank_wdi/2017-11-14/world_bank_se4all_database__energy_efficiency.meta.yml (16 sources: keys)
  • etl/steps/data/grapher/biodiversity/2022/living_planet_index.meta.yml (2 sources: keys)
  • etl/steps/data/grapher/gapminder/2019-05-25/fertility_rate.meta.yml (1 sources: keys)
  • etl/steps/data/grapher/iucn/2022-12-08/threatened_and_evaluated_species.meta.yml (2 sources: keys)

war.yml (7)

  • etl/steps/data/garden/war/2023-01-18/bouthoul_carrere_1978.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/clodfelter_2017.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/dunnigan_martel_1987.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/eckhardt_1991.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/kaye_1985.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/sorokin_1937.meta.yml (1 sources: keys)
  • etl/steps/data/garden/war/2023-01-18/sutton_1971.meta.yml (1 sources: keys)

main.yml (6)

  • etl/steps/data/garden/gapminder/2023-03-31/population.meta.yml (1 sources: keys)
  • etl/steps/data/garden/homicide/2024-07-30/homicide_long_run_omm.meta.yml (1 sources: keys)
  • etl/steps/data/garden/un/2024-09-11/igme.meta.yml (1 sources: keys)
  • etl/steps/data/garden/wvs/2023-06-25/longitudinal_wvs.meta.yml (1 sources: keys)
  • etl/steps/data/grapher/un/2023-08-16/un_sdg.meta.yml (2 sources: keys)
  • etl/steps/data/grapher/worldbank_wdi/2024-05-20/wdi.meta.yml (9 sources: keys)

artificial_intelligence.yml (2)

  • etl/steps/data/garden/artificial_intelligence/2023-06-26/ai_wrp_2021.meta.yml (1 sources: keys)
  • etl/steps/data/garden/artificial_intelligence/2023-06-26/ai_wrp_2021_grouped.meta.yml (1 sources: keys)

demography.yml (2)

  • etl/steps/data/garden/demography/2023-06-27/world_population_comparison.meta.yml (1 sources: keys)
  • etl/steps/data/garden/gapminder/2023-03-31/population.meta.yml (1 sources: keys)

education.yml (1)

  • etl/steps/data/garden/education/2023-08-09/clio_infra_education.meta.yml (1 sources: keys)

emissions.yml (1)

  • etl/steps/data/garden/andrew/2019-12-03/co2_mitigation_curves.meta.yml (1 sources: keys)

environment.yml (1)

  • etl/steps/data/garden/unep/2023-03-17/consumption_controlled_substances.meta.yml (1 sources: keys)

wizard.yml (1)

  • etl/steps/data/garden/dummy/2023-10-12/dummy_monster.meta.yml (1 sources: keys)

Active code files that still touch Source/sources

These need separate review because some are compatibility paths/backport code rather than live dataset metadata.

war.yml (12)

  • etl/steps/data/garden/war/2023-09-21/cow.py
  • etl/steps/data/grapher/war/2023-09-21/brecke.py
  • etl/steps/data/grapher/war/2023-09-21/cow.py
  • etl/steps/data/grapher/war/2023-09-21/cow_mid.py
  • etl/steps/data/grapher/war/2023-09-21/mars.py
  • etl/steps/data/grapher/war/2023-09-21/mie.py
  • etl/steps/data/grapher/war/2023-09-21/prio_v31.py
  • etl/steps/data/grapher/war/2025-06-13/ucdp.py
  • etl/steps/data/grapher/war/latest/ucdp_preview.py
  • etl/steps/data/meadow/war/2023-01-10/kaye_1985.py
  • etl/steps/data/meadow/war/2023-01-10/sorokin_1937.py
  • etl/steps/data/meadow/war/2023-01-10/sutton_1971.py

health.yml (8)

  • etl/steps/data/garden/health/2023-04-18/wgm_mental_health.py
  • etl/steps/data/garden/maternal_mortality/2024-07-08/maternal_mortality.py
  • etl/steps/data/garden/oecd/2023-05-01/health_pharma_market.py
  • etl/steps/data/garden/owid/latest/covid.py
  • etl/steps/data/meadow/gapminder/2024-07-08/maternal_mortality.py
  • etl/steps/data/meadow/health/2026-01-19/unaids.py
  • etl/steps/data/meadow/who/2024-01-03/gho.py
  • etl/steps/data/meadow/who/latest/fluid.py

poverty_inequality.yml (3)

  • etl/steps/data/external/owid_grapher/latest/int_dollar_conversions.py
  • etl/steps/data/meadow/cedlas/2025-04-01/sedlac.py
  • etl/steps/data/meadow/igh/2024-07-05/better_data_homelessness.py

chartbook.yml (2)

  • etl/steps/data/meadow/cedlas/2024-07-31/sedlac_poverty_2016.py
  • etl/steps/data/meadow/cedlas/2024-07-31/sedlac_poverty_2018.py

demography.yml (2)

  • etl/steps/data/grapher/un/2022-07-11/un_wpp.py
  • etl/steps/data/open_numbers/open_numbers/latest/gapminder__systema_globalis.py

faostat.yml (2)

  • etl/steps/data/meadow/faostat/2025-03-17/faostat_metadata.py
  • etl/steps/data/meadow/faostat/2026-02-25/faostat_metadata.py

open_numbers.yml (2)

  • etl/steps/data/open_numbers/open_numbers/latest/gapminder__systema_globalis.py
  • etl/steps/data/open_numbers/open_numbers/latest/open_numbers__world_development_indicators.py

agriculture.yml (1)

  • etl/steps/data/meadow/agriculture/2024-05-23/harris_et_al_2015.py

biodiversity.yml (1)

  • etl/steps/data/meadow/biodiversity/2026-04-16/cherry_blossom.py

covid.yml (1)

  • etl/steps/data/garden/excess_mortality/latest/excess_mortality/__init__.py

emissions.yml (1)

  • etl/steps/data/meadow/emissions/2025-11-26/electricity_emission_factors.py

energy.yml (1)

  • etl/steps/data/meadow/uk_beis/2023-12-12/uk_historical_electricity.py

equality.yml (1)

  • etl/steps/data/garden/wb/2025-09-08/gender_statistics.py

growth.yml (1)

  • etl/steps/data/garden/maternal_mortality/2024-07-08/maternal_mortality.py

main.yml (1)

  • etl/steps/data/grapher/un/2023-08-16/un_sdg.py

migration.yml (1)

  • etl/steps/data/meadow/unicef/2026-01-07/child_migration.py

minerals.yml (1)

  • etl/steps/data/garden/usgs/2025-12-15/mineral_commodity_summaries.py

Suggested next steps

  1. Finish dag/migrated.yml metadata YAML: convert the 7 remaining .meta.yml files with sources:. These are not snapshot DVCs; examples include Clio Infra, Waste, World Bank SE4ALL, and a few grapher files with empty sources: [] next to existing origins:.
  2. Convert the 73 remaining active snapshot DVC files outside dag/migrated.yml using the pattern from Migrate Source → Origin for dag/migrated.yml snapshots #5978. Use source metadata once as migration input, then hardcode origin: in the DVC and simplify scripts where they use backported config metadata.
  3. Convert active .meta.yml variable-level sources: to variable-level origins:. For empty sources: [] with existing origins:, just remove the empty sources:.
  4. Review active code files that still touch Source / .sources and separate true live metadata usage from compatibility/backport helpers.
  5. Add a guardrail to prevent new legacy source: / sources: in active DAG files once the migration is complete.

Notes from #5978

  • Do not infer date_published as simply the largest year in the metadata. Codex caught this: data/projection ranges like 1970–2050 or 2030-50 can otherwise become publication dates.
  • Safer inference used in Migrate Source → Origin for dag/migrated.yml snapshots #5978: prefer explicit publication_date / publication_year; otherwise look at citation-like source.published_by and source.name, skipping obvious range/projection years; use snapshot version only when clearly reviewing as a fallback.
  • Useful Source → Origin mapping:
    • source.nameorigin.title
    • source.published_byorigin.producer
    • source.descriptionorigin.description
    • source.urlorigin.url_main
    • source.source_data_urlorigin.url_download
    • source.date_accessedorigin.date_accessed
    • source.publication_date / publication_yearorigin.date_published
  • Schema requires date_published and citation_full for snapshot origins.

Regenerate this audit

Use a small script over active dag/*.yml files to collect active DAG references. For broad repo searches:

rg -n '^\s+(source|source_name)\s*:' snapshots --glob '*.dvc'
rg -n '^\s+sources\s*:' etl/steps snapshots dag --glob '*.yml' --glob '*.dvc' --glob '!dag/archive/**'

For dag/migrated.yml specifically, #5978 adds scripts/migrate_migrated_sources_to_origins.py; after #5978 it should print only the TSV header, confirming no migrated snapshot DVC legacy sources remain.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions