Skip to content

Commit 95583c8

Browse files
authored
New training data generation functions (#351)
## Description This PR adds all of the data generation functions we used to create new, prod-representative synthetic data. It also fixes a bug in the augmentation code (which at this point is largely deprecated and we should consider deleting?) and changes some regex and normalization parsing. ## Related Issues Closes #305 ## Additional Notes There is a LOT here, but I think it makes sense and is easiest to tackle reviewing things in a certain order. A lot of the functions build on each other and add knowledge cumulatively, so going bottom up will help later orchestrator functions make sense. I would approach things in this order: - Small changes to existing code (augmentation, test_augmentation, conftest, normalize, regex_patterns, and the one loinc data file I had to change because an unrecognized character got written in that I needed to fix for future searches) - schemas.LoincStruct (just a class that holds some values, but a validated one) - data_curation.loinc_utils and its test file (highly modular, can be grokked without specialized domain knowledge, all later things build on it) - data_curation.post_process and its test file (again, very modular, pretty intuitive and small in scope what each function is doing) - data_curation.data_emulation and its test file (this is the big daddy where the magic happens, but it all relies on the utils and post processing; the file has functions organized alphabetically, but my recommendation is to start with all the functions that start with leading underscores because those are smaller scale helpers; then, I'd hit the functions that look like `get_XXX_variations`, which are the functions that actually do data changing; then, I'd move to the functions that start with `build_` because those are more about applying direct formulas rather than varying small things; finally, the orchestrator function `create_[blah blah]` just calls everything else in logical sequence)
1 parent 078e405 commit 95583c8

20 files changed

+2442867
-26
lines changed

data/snoinc_extracts/loinc_component_abbrv_syn_20260223.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -357355,14 +357355,14 @@
357355357355
"abbrv": [],
357356357356
"synonyms": []
357357357357
},
357358-
"Ebolavirus Bundibugyo+Reston+Sudan+Ta\ufffd Forest+Zaire RNA": {
357358+
"Ebolavirus Bundibugyo+Reston+Sudan+Taï Forest+Zaire RNA": {
357359357359
"code": "LP437071-6",
357360357360
"abbrv": [
357361357361
"BDBV+EBOV+RESTV+SUDV+TAFV RNA"
357362357362
],
357363357363
"synonyms": []
357364357364
},
357365-
"Ebolavirus Bundibugyo+Reston+Sudan+Ta\ufffd Forest+Zaire": {
357365+
"Ebolavirus Bundibugyo+Reston+Sudan+Taï Forest+Zaire": {
357366357366
"code": "LP437072-4",
357367357367
"abbrv": [
357368357368
"BDBV+EBOV+RESTV+SUDV+TAFV"

0 commit comments

Comments
 (0)