[DOC] Fix confusing 'reverse complement' terminology in dataset and tutorial documentation (#538)

ChandanKT-git · web-flow · commit 13fd5cd5ccfe · 2026-04-27T16:46:13.000+02:00
#### Reference Issues/PRs Follow-up to PR #478 resolves #335 #### What does this implement/fix? Explain your changes. This PR fixes the confusing "reverse complement" terminology that was accidentally copy-pasted into the documentation across the repository. As discussed in #478, the `augment_reverse` implementation does exactly what was intended by the original paper (a simple string reversal). The language in these files has been synced to reflect "reversed sequences" to maintain technical accuracy and avoid biological confusion: - `pyaptamer/datasets/dataclasses/_api.py` (APIDataset docstring and `_prepare_data`) - `examples/aptatrans_tutorial.ipynb` (Pipeline explanation cells) - `pyaptamer/utils/_augment.py` (Return block docstring) #### What should a reviewer concentrate their feedback on? N/A - simple documentation/wording sync. #### Did you add any tests for the change? No (documentation only). #### Any other comments? None. #### PR checklist - [x] The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. - [ ] Added/modified tests - [x] Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with `pre-commit install`.
diff --git a/examples/aptatrans_tutorial.ipynb b/examples/aptatrans_tutorial.ipynb
@@ -113,7 +113,7 @@
     "### Load (RNA) aptamer data for pretraining\n",
     "For pretraining the aptamer encoder, we use $79,890$ RNA aptamer sequences from the *bpRNA-1m* dataset from *bpRNA* [[2](#ref-2)].\n",
     "\n",
-    "The sequences are augmented by adding their reverse complements. Then, they are masked to a numerical format suitable for the encoder and stored in PyTorch dataloaders."
+    "The sequences are augmented by adding their reversed sequences. Then, they are masked to a numerical format suitable for the encoder and stored in PyTorch dataloaders."
    ]
   },
   {
@@ -142,7 +142,7 @@
     "    random_state=RAMDOM_STATE,\n",
     ")\n",
     "\n",
-    "# (3.) augment training data by adding reverse complements\n",
+    "# (3.) augment training data by adding reversed sequences\n",
     "# e.g., (seq=\"ACG\", ss=\"SHM\") -> (seq=\"GCA\", ss=\"MHS\")\n",
     "x_apta_train, y_apta_train = augment_reverse(x_apta_train, y_apta_train)\n",
     "\n",
@@ -182,7 +182,7 @@
     "### Load protein data for pretraining\n",
     "For pretraining the protein encoder, we use $166,136$ protein sequences from the Protein Data Bank (PDB) [[3](#ref-3)].\n",
     "\n",
-    "In this case, the sequences are not augmented by adding the reverse complements. However, protein words with below average frequency are filtered out. Then, similarly to above, sequences are transformed to a numerical representation suitable for the encoder and stored in PyTorch dataloaders."
+    "In this case, the sequences are not augmented by adding the reversed sequences. However, protein words with below average frequency are filtered out. Then, similarly to above, sequences are transformed to a numerical representation suitable for the encoder and stored in PyTorch dataloaders."
    ]
   },
   {
diff --git a/pyaptamer/datasets/dataclasses/_api.py b/pyaptamer/datasets/dataclasses/_api.py
@@ -30,9 +30,8 @@ class APIDataset(Dataset):
         sequences.
     split : str, optional, default="train"
         If "train", the dataset will augment aptamer sequences by adding their
-        reverse complements. If "test", the dataset will not augment the aptamer
+        reversed sequences. If "test", the dataset will not augment the aptamer
         sequences.
-        complements.
     """
 
     def __init__(
@@ -67,8 +66,8 @@ def _prepare_data(
         split: str,
     ) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
         """
-        Prepare the data by augmenting aptamer sequences with their reverse complements
-        and transforming them to vector numericla representations.
+        Prepare the data by augmenting aptamer sequences with their reversed sequences
+        and transforming them to vector numerical representations.
 
         Parameters
         ----------
@@ -78,9 +77,9 @@ def _prepare_data(
             Protein sequences.
         y : np.ndarray
             Laabels for the interactions.
-        split : bool
-            If True, the dataset will augment aptamer sequences by adding their reverse
-            complements.
+        split : str
+            If "train", the dataset will augment aptamer sequences by adding
+            their reversed sequences.
         """
         if split == "train":
             x_apta = augment_reverse(x_apta)[0]