-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Description
We are increasingly relying on and exposing NCBI-derived data via JSON and a REST API, while tightly attaching records to external identifiers that 'change' over time.
We have already seen concrete examples of this causing confusion and maintenance burden (e.g. NCBI taxon ID and naming changes, superseded assemblies; see #499). As we ingest more data and expose it programmatically, this problem will grow.
Core problem
- External identifiers are not stable
- NCBI taxon IDs, assembly versions, and SRA metadata can change upstream.
- Our data is effectively a rolling snapshot
- Each ingest overwrites prior JSON outputs.
- API responses do not clearly indicate which upstream NCBI state they reflect.
- Identifiers are exposed as if they were truly stable
- Users may assume reproducibility or one-to-one correspondence with NCBI.
- Caching amplifies the issue
- API responses are cached (e.g. via Redis) using external identifiers as keys.
- Upstream changes + cached responses can lead to stale or inconsistent results.
Impact/ Urgency
Do I think this is 'critical' yet? Not quite, but I do think we should get ahead of it before we make it worse. Right now I'm pretty sure our worst case is something like an NCBI taxon id is cached in Redis as one we support, and then a change at NCBI means we have swapped it out for a different taxon id and don't really support it any longer. ideally what I think I'm looking for is explicit data snapshots with dates, which we can be transparent about with users. Possibly for now we just disable caching until we have a chance to discuss?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status