Skip to content

Timeline TSV Schema Inconsistency Between COVID and RSV Pipelines #51

@gordonkoehn

Description

@gordonkoehn

Problem: The COVID (SARS-CoV-2) and RSV automation pipelines use different column names and location formatting in their timeline.tsv files, causing downstream integration issues and unnecessary maintenance burden.

Note: Influenza does note have a timeline.tsv as it is not running the calculation of Variant Abundance with Lollipop so this output it not generate.

Column Name Differences

Field COVID Pipeline RSV Pipeline
Sample identifier sample submissionId
Primer protocol proto primerProtocol
Reference genome (missing) reference

Example Files

COVID: /cluster/project/pangolin/processes/sars_cov_2/lollipop/variants/timeline.tsv

sample	batch	reads	proto	location_code	date	location
A1_05_2025_11_05	20251128_2511665243	250	v532_pooled	5	2025-11-05	Lugano (TI)
A2_15_2025_11_06	20251128_2511665243	250	v532_pooled	15	2025-11-06	Basel (BS)

RSV: /cluster/project/pangolin/processes/rsv/RSVA/working/timeline.tsv

submissionId	batch	reads	reference	primerProtocol	location_code	date	location
A1_05_2025_11_05	20251128_2511665243	250	v532_pooled	Eawag-2024-v532_pooled	05	2025-11-05	Lugano
A2_15_2025_11_06	20251128_2511665243	250	v532_pooled	Eawag-2024-v532_pooled	15	2025-11-06	Basel

Location Formatting Differences

COVID: Includes canton codes with UTF-8 preservation

  • Lugano (TI)
  • Basel (BS)
  • Zürich (ZH) (preserves umlaut)

RSV: City names only, strips special characters

  • Lugano
  • Basel
  • Zurich (umlaut removed: üu)

Impact

  1. Data integration: Tools consuming both pipelines must handle two different schemas
  2. Location matching: The ZürichZurich transformation breaks joins on location names
  3. Maintenance burden: Schema changes require updates across multiple codebases
  4. Error-prone: Easy to accidentally use wrong column names across pipelines

Recommendation

Align schemas between pipelines. Choose one format as canonical and update the other to match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions