External identifier drift, data provenance, and API snapshot clarity

## Description

We are increasingly relying on and exposing NCBI-derived data via JSON and a REST API, while tightly attaching records to external identifiers that 'change' over time.

We have already seen concrete examples of this causing confusion and maintenance burden (e.g. NCBI taxon ID and naming changes, superseded assemblies; see #499). As we ingest more data and expose it programmatically, this problem will grow.

## Core problem

- External identifiers are not stable
    - NCBI taxon IDs, assembly versions, and SRA metadata can change upstream.
- Our data is effectively a rolling snapshot
    - Each ingest overwrites prior JSON outputs.
    - API responses do not clearly indicate which upstream NCBI state they reflect.
- Identifiers are exposed as if they were truly stable
    - Users may assume reproducibility or one-to-one correspondence with NCBI.
- Caching amplifies the issue
    - API responses are cached (e.g. via Redis) using external identifiers as keys.
    - Upstream changes + cached responses can lead to stale or inconsistent results.

## Impact/ Urgency

Do I think this is 'critical' yet? Not quite, but I do think we should get ahead of it before we make it worse. Right now I'm pretty sure our worst case is something like an NCBI taxon id is cached in Redis as one we support, and then a change at NCBI means we have swapped it out for a different taxon id and don't really support it any longer. ideally what I think I'm looking for is explicit data snapshots with dates, which we can be transparent about with users. **Possibly for now we just disable caching until we have a chance to discuss?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

External identifier drift, data provenance, and API snapshot clarity #1047

Description

Core problem

Impact/ Urgency

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

External identifier drift, data provenance, and API snapshot clarity #1047

Description

Description

Core problem

Impact/ Urgency

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions