Skip to content

External identifier drift, data provenance, and API snapshot clarity #1047

@d-callan

Description

@d-callan

Description

We are increasingly relying on and exposing NCBI-derived data via JSON and a REST API, while tightly attaching records to external identifiers that 'change' over time.

We have already seen concrete examples of this causing confusion and maintenance burden (e.g. NCBI taxon ID and naming changes, superseded assemblies; see #499). As we ingest more data and expose it programmatically, this problem will grow.

Core problem

  • External identifiers are not stable
    • NCBI taxon IDs, assembly versions, and SRA metadata can change upstream.
  • Our data is effectively a rolling snapshot
    • Each ingest overwrites prior JSON outputs.
    • API responses do not clearly indicate which upstream NCBI state they reflect.
  • Identifiers are exposed as if they were truly stable
    • Users may assume reproducibility or one-to-one correspondence with NCBI.
  • Caching amplifies the issue
    • API responses are cached (e.g. via Redis) using external identifiers as keys.
    • Upstream changes + cached responses can lead to stale or inconsistent results.

Impact/ Urgency

Do I think this is 'critical' yet? Not quite, but I do think we should get ahead of it before we make it worse. Right now I'm pretty sure our worst case is something like an NCBI taxon id is cached in Redis as one we support, and then a change at NCBI means we have swapped it out for a different taxon id and don't really support it any longer. ideally what I think I'm looking for is explicit data snapshots with dates, which we can be transparent about with users. Possibly for now we just disable caching until we have a chance to discuss?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions