Commit 95583c8
authored
New training data generation functions (#351)
## Description
This PR adds all of the data generation functions we used to create new,
prod-representative synthetic data. It also fixes a bug in the
augmentation code (which at this point is largely deprecated and we
should consider deleting?) and changes some regex and normalization
parsing.
## Related Issues
Closes #305
## Additional Notes
There is a LOT here, but I think it makes sense and is easiest to tackle
reviewing things in a certain order. A lot of the functions build on
each other and add knowledge cumulatively, so going bottom up will help
later orchestrator functions make sense. I would approach things in this
order:
- Small changes to existing code (augmentation, test_augmentation,
conftest, normalize, regex_patterns, and the one loinc data file I had
to change because an unrecognized character got written in that I needed
to fix for future searches)
- schemas.LoincStruct (just a class that holds some values, but a
validated one)
- data_curation.loinc_utils and its test file (highly modular, can be
grokked without specialized domain knowledge, all later things build on
it)
- data_curation.post_process and its test file (again, very modular,
pretty intuitive and small in scope what each function is doing)
- data_curation.data_emulation and its test file (this is the big daddy
where the magic happens, but it all relies on the utils and post
processing; the file has functions organized alphabetically, but my
recommendation is to start with all the functions that start with
leading underscores because those are smaller scale helpers; then, I'd
hit the functions that look like `get_XXX_variations`, which are the
functions that actually do data changing; then, I'd move to the
functions that start with `build_` because those are more about applying
direct formulas rather than varying small things; finally, the
orchestrator function `create_[blah blah]` just calls everything else in
logical sequence)1 parent 078e405 commit 95583c8
File tree
20 files changed
+2442867
-26
lines changed- data
- snoinc_extracts
- training_files
- packages
- data-curation
- src/data_curation
- schemas
- tests
- utils
- src/utils
- tests
20 files changed
+2442867
-26
lines changedLines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
357355 | 357355 | | |
357356 | 357356 | | |
357357 | 357357 | | |
357358 | | - | |
| 357358 | + | |
357359 | 357359 | | |
357360 | 357360 | | |
357361 | 357361 | | |
357362 | 357362 | | |
357363 | 357363 | | |
357364 | 357364 | | |
357365 | | - | |
| 357365 | + | |
357366 | 357366 | | |
357367 | 357367 | | |
357368 | 357368 | | |
| |||
0 commit comments