Skip to content

Commit 5433a4a

Browse files
authored
Generate positive pairs (#87)
## Description This PR creates a function that generates positive pairs of given LOINC data and writes them to a file for use in model training. ## Related Issues Closes #61 ## Additional Notes: While I couldn't test it on the LOINC files that we haven't made yet, I did some testing on some dummy files and directories on my local (since it's a function that produces file output and will be run one-off I didn't see much point in unit tests beyond making sure it works). That means this PR is actually **not blocked** by other dependencies, we'll just need to wire up the file paths once that's done.
1 parent 25a61e3 commit 5433a4a

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

data_curation/generation.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
import random
2+
from io import TextIOWrapper
3+
4+
BASE_FILE_PATH = ""
5+
OUT_FILE_PATH = ""
6+
7+
8+
def generate_positive_pairs(file_handle: str, num_examples: int, out_file: str):
9+
"""
10+
Given the location of one or more files of LOINC codes and some corresponding
11+
augmented examples for those codes, this function compiles a list of
12+
positive pairs that can be read for model training. A positive pair is a
13+
tuple of the form (original_loinc_code, augmented_example_of_code).
14+
15+
:param file_handle: Either the path to a specific file of LOINC codes and
16+
examples, or the prefix path for multiple data files across name variants.
17+
:param num_examples: The number of positive pairs to generate. If -1, one
18+
positive pair will be created for every element in the pool spanned by
19+
the files accessible via the handle parameter.
20+
:param out_file: The destination at which to write the positive pair file.
21+
:returns: None
22+
"""
23+
24+
handle_parts = file_handle.split(".")
25+
data_pool = []
26+
pairs = []
27+
28+
# Given handle is a prefix to multiple files, so we'll use naming
29+
# conventions to open the three appropriate ones
30+
if len(handle_parts) == 0 or handle_parts[-1] != "txt":
31+
for variant in ["lcn.txt", "sn.txt", "dn.txt"]:
32+
with open(file_handle + "_" + variant, "r") as fp:
33+
_append_to_data_pool(fp, data_pool)
34+
35+
# Handle is actually a file, can just open that
36+
else:
37+
with open(file_handle, "r") as fp:
38+
_append_to_data_pool(fp, data_pool)
39+
40+
# Pre-specified number of examples to generate
41+
# If num_examples is -1, that's "generate all" mode, where we
42+
# produce one positive pair per code in the data pool, but we
43+
# can achieve that just by not truncating the pool list
44+
if num_examples != -1:
45+
random.shuffle(data_pool)
46+
data_pool = data_pool[:num_examples]
47+
48+
for element in data_pool:
49+
pool_parts = element.split(":")
50+
base_code = pool_parts[0].strip()
51+
augmented_examples = pool_parts[1].strip().split("|")
52+
53+
# Randomly choose one of the augmented examples to pair
54+
chosen_ex = random.choice(augmented_examples)
55+
pairs.append((base_code, chosen_ex.strip()))
56+
57+
# Now we just write the created examples to the output file
58+
with open(out_file, "w") as fp:
59+
for pair in pairs:
60+
fp.write(pair[0] + "|" + pair[1] + "\n")
61+
62+
63+
def _append_to_data_pool(fp: TextIOWrapper, data_pool):
64+
"""
65+
Simple helper method to append non-blank lines to a list.
66+
"""
67+
for line in fp:
68+
if line.strip() != "":
69+
data_pool.append(line.strip())

0 commit comments

Comments
 (0)