Skip to content

Commit 13fd5cd

Browse files
[DOC] Fix confusing 'reverse complement' terminology in dataset and tutorial documentation (#538)
#### Reference Issues/PRs Follow-up to PR #478 resolves #335 #### What does this implement/fix? Explain your changes. This PR fixes the confusing "reverse complement" terminology that was accidentally copy-pasted into the documentation across the repository. As discussed in #478, the `augment_reverse` implementation does exactly what was intended by the original paper (a simple string reversal). The language in these files has been synced to reflect "reversed sequences" to maintain technical accuracy and avoid biological confusion: - `pyaptamer/datasets/dataclasses/_api.py` (APIDataset docstring and `_prepare_data`) - `examples/aptatrans_tutorial.ipynb` (Pipeline explanation cells) - `pyaptamer/utils/_augment.py` (Return block docstring) #### What should a reviewer concentrate their feedback on? N/A - simple documentation/wording sync. #### Did you add any tests for the change? No (documentation only). #### Any other comments? None. #### PR checklist - [x] The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. - [ ] Added/modified tests - [x] Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with `pre-commit install`.
1 parent 6cb2046 commit 13fd5cd

2 files changed

Lines changed: 9 additions & 10 deletions

File tree

examples/aptatrans_tutorial.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@
113113
"### Load (RNA) aptamer data for pretraining\n",
114114
"For pretraining the aptamer encoder, we use $79,890$ RNA aptamer sequences from the *bpRNA-1m* dataset from *bpRNA* [[2](#ref-2)].\n",
115115
"\n",
116-
"The sequences are augmented by adding their reverse complements. Then, they are masked to a numerical format suitable for the encoder and stored in PyTorch dataloaders."
116+
"The sequences are augmented by adding their reversed sequences. Then, they are masked to a numerical format suitable for the encoder and stored in PyTorch dataloaders."
117117
]
118118
},
119119
{
@@ -142,7 +142,7 @@
142142
" random_state=RAMDOM_STATE,\n",
143143
")\n",
144144
"\n",
145-
"# (3.) augment training data by adding reverse complements\n",
145+
"# (3.) augment training data by adding reversed sequences\n",
146146
"# e.g., (seq=\"ACG\", ss=\"SHM\") -> (seq=\"GCA\", ss=\"MHS\")\n",
147147
"x_apta_train, y_apta_train = augment_reverse(x_apta_train, y_apta_train)\n",
148148
"\n",
@@ -182,7 +182,7 @@
182182
"### Load protein data for pretraining\n",
183183
"For pretraining the protein encoder, we use $166,136$ protein sequences from the Protein Data Bank (PDB) [[3](#ref-3)].\n",
184184
"\n",
185-
"In this case, the sequences are not augmented by adding the reverse complements. However, protein words with below average frequency are filtered out. Then, similarly to above, sequences are transformed to a numerical representation suitable for the encoder and stored in PyTorch dataloaders."
185+
"In this case, the sequences are not augmented by adding the reversed sequences. However, protein words with below average frequency are filtered out. Then, similarly to above, sequences are transformed to a numerical representation suitable for the encoder and stored in PyTorch dataloaders."
186186
]
187187
},
188188
{

pyaptamer/datasets/dataclasses/_api.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,8 @@ class APIDataset(Dataset):
3030
sequences.
3131
split : str, optional, default="train"
3232
If "train", the dataset will augment aptamer sequences by adding their
33-
reverse complements. If "test", the dataset will not augment the aptamer
33+
reversed sequences. If "test", the dataset will not augment the aptamer
3434
sequences.
35-
complements.
3635
"""
3736

3837
def __init__(
@@ -67,8 +66,8 @@ def _prepare_data(
6766
split: str,
6867
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
6968
"""
70-
Prepare the data by augmenting aptamer sequences with their reverse complements
71-
and transforming them to vector numericla representations.
69+
Prepare the data by augmenting aptamer sequences with their reversed sequences
70+
and transforming them to vector numerical representations.
7271
7372
Parameters
7473
----------
@@ -78,9 +77,9 @@ def _prepare_data(
7877
Protein sequences.
7978
y : np.ndarray
8079
Laabels for the interactions.
81-
split : bool
82-
If True, the dataset will augment aptamer sequences by adding their reverse
83-
complements.
80+
split : str
81+
If "train", the dataset will augment aptamer sequences by adding
82+
their reversed sequences.
8483
"""
8584
if split == "train":
8685
x_apta = augment_reverse(x_apta)[0]

0 commit comments

Comments
 (0)