-
Notifications
You must be signed in to change notification settings - Fork 4
Cache: Only update when our data changes #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Bug fix: Only update when MIMs are fetched that contain data changes that we care about (or if there are new rows). - Bug fix: Pipe-delimted values sorting
- Updated data files
- Bug fix: Added pipe-delimited sorting bug fix mentioned but not implemented in recent commit. - Bug fix: KeyError occurring when new cache data, but it is missing UMLS or Orphanet mappings. No idea why this KeyError never occurred before. - Update / Bug fix: Pubmed refs data cache file: Values are now sorted
|
@twhetzel Only assigning the PR to you in case we don't merge this this weekend, in order to remind you to merge it later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the cache update logic to trigger updates only when data changes, fixes bugs related to pipe-delimited value sorting and KeyErrors when mappings are missing, and updates cached data files.
- Only update cache rows when key data fields differ from existing cache rows.
- Fix the pipe-delimited sorting for UMLS and Orphanet mappings and guard against missing mapping values.
- Refresh metadata by updating cache-last-updated.txt and related cache files.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| omim2obo/parsers/omim_txt_parser.py | Added a new upsert_cache_df function and modified cache update logic; added todo for renaming date_fetched. |
| omim2obo/parsers/omim_entry_parser.py | Updated get_pubs and get_mapped_ids to return sorted results ensuring consistent outputs. |
| data/mappings.tsv | Updated cache file rows with new values and adjusted flags to reflect correct data. |
| data/cache-last-updated.txt | Updated cache last updated date. |
Comments suppressed due to low confidence (1)
data/mappings.tsv:3159
- Verify that changing the phenotype flag for MIM 169500 from 'True' to 'False' is intentional, as this adjustment could affect downstream processing.
169500 False 2025-05-30 C1868512 99027
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large diff in this file is due to lack of sorting in |-delimited value columns. That's fixed now though; won't happen again in future updates. Sorting is now deterministic.
| 621179 True 2025-04-22 | ||
| 621180 True 2025-04-22 | ||
| 621181 False 2025-04-22 | ||
| 621182 True 2025-05-30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some data observations.
New MIMs not previously in cache. These may be new MIMs entirely.
I'm seeing most of these new MIMs appear at the bottom of the file because it is sorted by MIM# and because new MIMs are created with numbers above the previous max MIM #.
However, there are some newly added MIMs here that aren't at the top of the range. Perhaps some MIM# ranges are intentionally reserved.
+301147 False 2025-05-30
+301148 True 2025-05-30
+301149 False 2025-05-30| 138920 False 2025-03-21 C1841836 | ||
| 138930 False 2025-03-21 C1841835 2097 | ||
| 138945 False 2025-03-23 C0282513|C0338451|C1415311|C3539123|C4016134 | ||
| 138945 False 2025-05-30 C1415311|C1843792|C3539123|C5975642|C5975643 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: Same number of mappings, w/ additions and deletions.
Some mappings have been deleted (e.g. C0282513), while others added (e.g. C1843792). Some have been maintained (e.g. C1415311).
| 169300 False 2025-03-20 C2051831 | ||
| 169400 True 2025-03-24 C0030779 | ||
| 169500 True 2025-03-19 C1868512 99027 | ||
| 169500 False 2025-05-30 C1868512 99027 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: Phenotype status changed.
@twhetzel Just tagging you because you may find this interesting. I would expect such changes to be rare.
| 300823 False 2025-03-24 C0026705|C0342841|C0342842|C1415882 | ||
| 300824 False 2025-03-21 C3151780 | ||
| 300825 False 2025-03-21 C1419292 | ||
| 300825 False 2025-05-30 C1419292|C5974893 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: Case of new mappings added, while previous mapping(s) existed.
| 621076 False 2025-03-21 | ||
| 621077 False 2025-03-23 | ||
| 621078 True 2025-03-19 | ||
| 621078 True 2025-05-30 C5975603 476093 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: Case of new mappings added where none previously existed.
I am seeing these clustered towards the bottom, right before the batch of completely freshly added MIMs.
This indicates that the workflow for adding new MIMs is something like:
- When new MIMs are created, they usually do not have any mappings or pubmed refs (the latter certainly makes sense).
- Soon after, mappings will be added.
resolves #219
Updates:
date_fetchedcolumn changed.