Skip to content

Conversation

@joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented May 30, 2025

resolves #219

Updates:

  • Now only triggers an update of cached data for a given MIM if any of the data we care about (UMLS & Orphanet mappings, and Pubmed refs) and cached is changed. Previously, many rows would be updated with only the date_fetched column changed.
  • Bug fix: Pipe-delimited values sorting
  • Bug fix: KeyError occurring when new cache data, but it is missing UMLS or Orphanet mappings. No idea why this KeyError never occurred before.
  • Updated cache files w/ new data

joeflack4 added 3 commits May 30, 2025 14:55
- Bug fix: Only update when MIMs are fetched that contain data changes that we care about (or if there are new rows).
- Bug fix: Pipe-delimted values sorting
- Bug fix: Added pipe-delimited sorting bug fix mentioned but not implemented in recent commit.
- Bug fix: KeyError occurring when new cache data, but it is missing UMLS or Orphanet mappings. No idea why this KeyError never occurred before.
- Update / Bug fix: Pubmed refs data cache file: Values are now sorted
@joeflack4 joeflack4 requested review from Copilot and twhetzel May 30, 2025 19:58
@joeflack4 joeflack4 assigned joeflack4 and twhetzel and unassigned joeflack4 May 30, 2025
@joeflack4
Copy link
Contributor Author

joeflack4 commented May 30, 2025

@twhetzel Only assigning the PR to you in case we don't merge this this weekend, in order to remind you to merge it later.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the cache update logic to trigger updates only when data changes, fixes bugs related to pipe-delimited value sorting and KeyErrors when mappings are missing, and updates cached data files.

  • Only update cache rows when key data fields differ from existing cache rows.
  • Fix the pipe-delimited sorting for UMLS and Orphanet mappings and guard against missing mapping values.
  • Refresh metadata by updating cache-last-updated.txt and related cache files.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File Description
omim2obo/parsers/omim_txt_parser.py Added a new upsert_cache_df function and modified cache update logic; added todo for renaming date_fetched.
omim2obo/parsers/omim_entry_parser.py Updated get_pubs and get_mapped_ids to return sorted results ensuring consistent outputs.
data/mappings.tsv Updated cache file rows with new values and adjusted flags to reflect correct data.
data/cache-last-updated.txt Updated cache last updated date.
Comments suppressed due to low confidence (1)

data/mappings.tsv:3159

  • Verify that changing the phenotype flag for MIM 169500 from 'True' to 'False' is intentional, as this adjustment could affect downstream processing.
169500	False	2025-05-30	C1868512	99027

@joeflack4 joeflack4 added bug Something isn't working data quality omim labels May 30, 2025
@joeflack4 joeflack4 linked an issue May 30, 2025 that may be closed by this pull request
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large diff in this file is due to lack of sorting in |-delimited value columns. That's fixed now though; won't happen again in future updates. Sorting is now deterministic.

621179 True 2025-04-22
621180 True 2025-04-22
621181 False 2025-04-22
621182 True 2025-05-30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some data observations.

New MIMs not previously in cache. These may be new MIMs entirely.

I'm seeing most of these new MIMs appear at the bottom of the file because it is sorted by MIM# and because new MIMs are created with numbers above the previous max MIM #.

However, there are some newly added MIMs here that aren't at the top of the range. Perhaps some MIM# ranges are intentionally reserved.

+301147 False   2025-05-30
+301148 True    2025-05-30
+301149 False   2025-05-30

138920 False 2025-03-21 C1841836
138930 False 2025-03-21 C1841835 2097
138945 False 2025-03-23 C0282513|C0338451|C1415311|C3539123|C4016134
138945 False 2025-05-30 C1415311|C1843792|C3539123|C5975642|C5975643
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Same number of mappings, w/ additions and deletions.

Some mappings have been deleted (e.g. C0282513), while others added (e.g. C1843792). Some have been maintained (e.g. C1415311).

169300 False 2025-03-20 C2051831
169400 True 2025-03-24 C0030779
169500 True 2025-03-19 C1868512 99027
169500 False 2025-05-30 C1868512 99027
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Phenotype status changed.

@twhetzel Just tagging you because you may find this interesting. I would expect such changes to be rare.

300823 False 2025-03-24 C0026705|C0342841|C0342842|C1415882
300824 False 2025-03-21 C3151780
300825 False 2025-03-21 C1419292
300825 False 2025-05-30 C1419292|C5974893
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Case of new mappings added, while previous mapping(s) existed.

621076 False 2025-03-21
621077 False 2025-03-23
621078 True 2025-03-19
621078 True 2025-05-30 C5975603 476093
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Case of new mappings added where none previously existed.

I am seeing these clustered towards the bottom, right before the batch of completely freshly added MIMs.

This indicates that the workflow for adding new MIMs is something like:

  1. When new MIMs are created, they usually do not have any mappings or pubmed refs (the latter certainly makes sense).
  2. Soon after, mappings will be added.

Base automatically changed from develop to main June 30, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working data quality omim

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache: Only update when 'our' data changes

3 participants