Cache: Only update when our data changes #225

joeflack4 · 2025-05-30T19:58:12Z

resolves #219

Updates:

Now only triggers an update of cached data for a given MIM if any of the data we care about (UMLS & Orphanet mappings, and Pubmed refs) and cached is changed. Previously, many rows would be updated with only the date_fetched column changed.
Bug fix: Pipe-delimited values sorting
Bug fix: KeyError occurring when new cache data, but it is missing UMLS or Orphanet mappings. No idea why this KeyError never occurred before.
Updated cache files w/ new data

- Bug fix: Only update when MIMs are fetched that contain data changes that we care about (or if there are new rows). - Bug fix: Pipe-delimted values sorting

- Updated data files

- Bug fix: Added pipe-delimited sorting bug fix mentioned but not implemented in recent commit. - Bug fix: KeyError occurring when new cache data, but it is missing UMLS or Orphanet mappings. No idea why this KeyError never occurred before. - Update / Bug fix: Pubmed refs data cache file: Values are now sorted

joeflack4 · 2025-05-30T19:58:41Z

@twhetzel Only assigning the PR to you in case we don't merge this this weekend, in order to remind you to merge it later.

Copilot

Pull Request Overview

This PR refactors the cache update logic to trigger updates only when data changes, fixes bugs related to pipe-delimited value sorting and KeyErrors when mappings are missing, and updates cached data files.

Only update cache rows when key data fields differ from existing cache rows.
Fix the pipe-delimited sorting for UMLS and Orphanet mappings and guard against missing mapping values.
Refresh metadata by updating cache-last-updated.txt and related cache files.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File	Description
omim2obo/parsers/omim_txt_parser.py	Added a new upsert_cache_df function and modified cache update logic; added todo for renaming date_fetched.
omim2obo/parsers/omim_entry_parser.py	Updated get_pubs and get_mapped_ids to return sorted results ensuring consistent outputs.
data/mappings.tsv	Updated cache file rows with new values and adjusted flags to reflect correct data.
data/cache-last-updated.txt	Updated cache last updated date.

Comments suppressed due to low confidence (1)

data/mappings.tsv:3159

Verify that changing the phenotype flag for MIM 169500 from 'True' to 'False' is intentional, as this adjustment could affect downstream processing.

169500	False	2025-05-30	C1868512	99027

omim2obo/parsers/omim_txt_parser.py

joeflack4 · 2025-05-30T20:06:58Z

data/pubmed-refs.tsv

Large diff in this file is due to lack of sorting in |-delimited value columns. That's fixed now though; won't happen again in future updates. Sorting is now deterministic.

joeflack4 · 2025-05-30T20:14:36Z

data/mappings.tsv

 621179	True	2025-04-22		
 621180	True	2025-04-22		
 621181	False	2025-04-22		
+621182	True	2025-05-30		


Just some data observations.

New MIMs not previously in cache. These may be new MIMs entirely.

I'm seeing most of these new MIMs appear at the bottom of the file because it is sorted by MIM# and because new MIMs are created with numbers above the previous max MIM #.

However, there are some newly added MIMs here that aren't at the top of the range. Perhaps some MIM# ranges are intentionally reserved.

+301147 False 2025-05-30 +301148 True 2025-05-30 +301149 False 2025-05-30

joeflack4 · 2025-05-30T20:16:55Z

data/mappings.tsv

 138920	False	2025-03-21	C1841836	
 138930	False	2025-03-21	C1841835	2097
-138945	False	2025-03-23	C0282513|C0338451|C1415311|C3539123|C4016134	
+138945	False	2025-05-30	C1415311|C1843792|C3539123|C5975642|C5975643	


Observation: Same number of mappings, w/ additions and deletions.

Some mappings have been deleted (e.g. C0282513), while others added (e.g. C1843792). Some have been maintained (e.g. C1415311).

joeflack4 · 2025-05-30T20:18:03Z

data/mappings.tsv

 169300	False	2025-03-20	C2051831	
 169400	True	2025-03-24	C0030779	
-169500	True	2025-03-19	C1868512	99027
+169500	False	2025-05-30	C1868512	99027


Observation: Phenotype status changed.

@twhetzel Just tagging you because you may find this interesting. I would expect such changes to be rare.

joeflack4 · 2025-05-30T20:18:43Z

data/mappings.tsv

 300823	False	2025-03-24	C0026705|C0342841|C0342842|C1415882	
 300824	False	2025-03-21	C3151780	
-300825	False	2025-03-21	C1419292	
+300825	False	2025-05-30	C1419292|C5974893	


Observation: Case of new mappings added, while previous mapping(s) existed.

joeflack4 · 2025-05-30T20:21:22Z

data/mappings.tsv

 621076	False	2025-03-21		
 621077	False	2025-03-23		
-621078	True	2025-03-19		
+621078	True	2025-05-30	C5975603	476093


Observation: Case of new mappings added where none previously existed.

I am seeing these clustered towards the bottom, right before the batch of completely freshly added MIMs.

This indicates that the workflow for adding new MIMs is something like:

When new MIMs are created, they usually do not have any mappings or pubmed refs (the latter certainly makes sense).

Soon after, mappings will be added.

joeflack4 added 3 commits May 30, 2025 14:55

Update: Caching: Only update rows with new data

54dd9d6

- Bug fix: Only update when MIMs are fetched that contain data changes that we care about (or if there are new rows). - Bug fix: Pipe-delimted values sorting

Update: Caching: Only update rows with new data

9c109f9

- Updated data files

joeflack4 requested review from Copilot and twhetzel May 30, 2025 19:58

joeflack4 assigned joeflack4 and twhetzel and unassigned joeflack4 May 30, 2025

Copilot AI reviewed May 30, 2025

View reviewed changes

omim2obo/parsers/omim_txt_parser.py Show resolved Hide resolved

joeflack4 added bug Something isn't working data quality omim labels May 30, 2025

joeflack4 linked an issue May 30, 2025 that may be closed by this pull request

Cache: Only update when 'our' data changes #219

Open

joeflack4 mentioned this pull request May 30, 2025

Updates: PubMed refs, Orphanet & UMLS mappings #222

Closed

joeflack4 commented May 30, 2025

View reviewed changes

joeflack4 mentioned this pull request Jun 1, 2025

Updates: PubMed refs, Orphanet & UMLS mappings #226

Closed

Base automatically changed from develop to main June 30, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache: Only update when our data changes #225

Cache: Only update when our data changes #225

Uh oh!

joeflack4 commented May 30, 2025 •

edited

Loading

Uh oh!

joeflack4 commented May 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

joeflack4 May 30, 2025

Uh oh!

joeflack4 May 30, 2025

Uh oh!

joeflack4 May 30, 2025

Uh oh!

joeflack4 May 30, 2025

Uh oh!

joeflack4 May 30, 2025

Uh oh!

joeflack4 May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cache: Only update when our data changes #225

Are you sure you want to change the base?

Cache: Only update when our data changes #225

Uh oh!

Conversation

joeflack4 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeflack4 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joeflack4 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeflack4 commented May 30, 2025 •

edited

Loading

joeflack4 commented May 30, 2025 •

edited

Loading