Skip to content

Coordinated ingestion and internal consistency across related data #1049

@d-callan

Description

@d-callan

Description

BRC Analytics currently ingests related data via multiple scripts and upstream sources (e.g. organism and assembly metadata from NCBI; SRA-derived summaries from ENA) and potentially at different times.

As ingestion expands (e.g. adding SRA metadata), this raises concerns about internal consistency: related entities may reflect different upstream states even when sourced from the same resource.

Core concerns

  • Organisms/ assemblies and derived summaries (e.g. SRA run counts) are generated by different scripts.
  • Different scripts can potentially be run at different times, or using different resources (ENA and NCBI take time to sync)
  • It is unclear yet how ingesting SRA metadata will work, but it does seem clear that the data ingested will relate to data produced by these existing scripts
  • Relationships between all of these entities are implicit and at risk of internal inconsistency

Impact/ Urgency

This is another one that isn't really 'critical' yet but is probably worth thinking some about before we make it any worse. It seems to me we're at a sort of inflection point with the SRA metadata discussions. In the immediate term (pre-SRA-metadata) I think our worst case is stepper UI possibly claiming the wrong number of sequences to browse from ENA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions