Skip to content

Commit eafb468

Browse files
authored
[GEN-2407] Add to genie cli to skip all validation rules requiring internal database access for local validator (#628)
* add new parameter skip_database_checks to handle all internal database validation checks * update local validation instruction, remove duplication and add to docstring * refactor cna validation * add numbers * lint * update docstring * lint
1 parent 91ce5d9 commit eafb468

File tree

12 files changed

+322
-122
lines changed

12 files changed

+322
-122
lines changed

README.md

Lines changed: 1 addition & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,6 @@
1313
- [Documentation](#documentation)
1414
- [Dependencies](#dependencies)
1515
- [File Validator](#file-validator)
16-
- [Setting up your environment](#setting-up-your-environment)
17-
- [Running the validator](#running-the-validator)
18-
- [Example commands](#example-commands)
1916
- [Contributing](#contributing)
2017
- [Sage Bionetworks Only](#sage-bionetworks-only)
2118
- [Running locally](#running-locally)
@@ -57,57 +54,7 @@ This package contains both R, Python and cli tools. These are tools or packages
5754

5855
## File Validator
5956

60-
One of the features of the `aacrgenie` package is that is provides a local validation tool that GENIE data contributors and install and use to validate their files locally prior to uploading to Synapse.
61-
62-
63-
### Setting up your environment
64-
65-
These instructions will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client.
66-
67-
1. Create a virtual environment using package manager of your choice (e.g: `conda`, `pipenv`, `pip`)
68-
69-
Example of creating a simple python environment
70-
71-
```
72-
python3 -m venv <env_name>
73-
source <env_name>/bin/activate
74-
```
75-
76-
2. Install the genie package
77-
78-
```
79-
pip install aacrgenie
80-
```
81-
82-
3. Verify the installation
83-
84-
```
85-
genie -v
86-
```
87-
88-
4. Set up authentication with Synapse through the [local .synapseConfig](https://python-docs.synapse.org/tutorials/authentication/#use-synapseconfig) or using an [environment variable](https://python-docs.synapse.org/tutorials/authentication/#use-environment-variable)
89-
90-
### Running the validator
91-
92-
Get help of all available commands
93-
94-
```
95-
genie validate -h
96-
```
97-
98-
### Example commands
99-
100-
Running validator on clinical file
101-
102-
```
103-
genie validate data_clinical_supp_SAGE.txt SAGE
104-
```
105-
106-
Running validator on cna file. **Note** that the flag `--nosymbol-check` is **REQUIRED** when running the validator for cna files because you would need access to an internal bed database table without it. For DEVELOPERS this is not required.
107-
108-
```
109-
genie validate data_cna_SAGE.txt SAGE --nosymbol-check
110-
```
57+
Please see the [local file validation tutorial](/docs/tutorials/local_file_validation.md) for more information on this and how to use it.
11158

11259
## Contributing
11360

docs/tutorials/local_file_validation.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,39 @@
22

33
One of the features of the `aacrgenie` package is that is provides a local validation tool that GENIE data contributors and install and use to validate their files locally prior to uploading to Synapse.
44

5+
## Setting up your environment
6+
7+
These instructions will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client.
8+
9+
1. Create a virtual environment using package manager of your choice (e.g: `conda`, `pipenv`, `pip`)
10+
11+
Example of creating a simple python environment
12+
13+
```bash
14+
python3 -m venv <env_name>
15+
source <env_name>/bin/activate
516
```
17+
18+
2. Install the genie package
19+
20+
```bash
621
pip install aacrgenie
22+
```
23+
24+
3. Verify the installation
25+
26+
```bash
727
genie -v
828
```
929

10-
This will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client. Please view the help to see how to run to validator.
30+
4. Set up authentication with Synapse through the [local .synapseConfig](https://python-docs.synapse.org/tutorials/authentication/#use-synapseconfig) or using an [environment variable](https://python-docs.synapse.org/tutorials/authentication/#use-environment-variable)
31+
32+
33+
This will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client.
34+
35+
## Running the validator
36+
37+
Please view the help to see how to run the validator.
1138

1239
```
1340
genie validate -h
@@ -18,3 +45,22 @@ Validate a file
1845
```
1946
genie validate data_clinical_supp_SAGE.txt SAGE
2047
```
48+
49+
### Special Consideration
50+
51+
The flag `--skip-database-checks` is **REQUIRED** when running the validator for cna and assay information files because you would need access to internal bed and clinical database tables respectively without it. Without the flag, you will hit an Synapse `READ access` error.
52+
53+
54+
#### Examples
55+
56+
Running validator on cna file.
57+
58+
```
59+
genie validate data_cna_SAGE.txt SAGE --skip-database-checks
60+
```
61+
62+
Running validator on assay_information file.
63+
64+
```
65+
genie validate assay_information.yaml SAGE --skip-database-checks
66+
```

genie/__main__.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -95,10 +95,9 @@ def build_parser():
9595
)
9696

9797
parser_validate.add_argument(
98-
"--nosymbol-check",
98+
"--skip-database-checks",
9999
action="store_true",
100-
help="Ignores specific post-processing validation criteria related to HUGO symbols "
101-
"in the structural variant and cna files.",
100+
help="Ignores validation checks that require internal database access",
102101
)
103102

104103
# TODO: remove this default when private genie project is ready

genie/input_to_database.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,7 @@ def validatefile(
349349
# validator.entitylist = [syn.get(entity) for entity in entities]
350350

351351
valid_cls, message = validator.validate_single_file(
352-
oncotree_link=genie_config["oncotreeLink"], nosymbol_check=False
352+
oncotree_link=genie_config["oncotreeLink"], skip_database_checks=False
353353
)
354354

355355
logger.info("VALIDATION COMPLETE")

genie/validate.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ def validate_single_file(self, **kwargs):
132132
class GenieValidationHelper(ValidationHelper):
133133
"""A validator helper class for AACR Project Genie."""
134134

135-
_validate_kwargs = ["nosymbol_check"]
135+
_validate_kwargs = ["skip_database_checks"]
136136

137137

138138
# TODO: Currently only checks if a user has READ permissions
@@ -249,7 +249,7 @@ def _perform_validate(syn, args):
249249
genie_config=genie_config,
250250
)
251251
mykwargs = dict(
252-
nosymbol_check=args.nosymbol_check,
252+
skip_database_checks=args.skip_database_checks,
253253
project_id=args.project_id,
254254
)
255255
valid, message = validator.validate_single_file(**mykwargs)

genie_registry/assay.py

Lines changed: 57 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""Assay information class"""
22

33
import os
4+
from typing import Tuple
45

56
import pandas as pd
67
import yaml
@@ -15,7 +16,7 @@ class Assayinfo(FileTypeFormat):
1516

1617
_process_kwargs = ["newPath", "databaseSynId"]
1718

18-
# _validation_kwargs = ["project_id"]
19+
_validation_kwargs = ["skip_database_checks"] # project_id
1920

2021
def _validateFilename(self, filepath_list):
2122
"""Validate assay information filename"""
@@ -127,15 +128,19 @@ def _get_dataframe(self, filepath_list):
127128
all_panel_info = pd.concat([all_panel_info, assay_finaldf])
128129
return all_panel_info
129130

130-
def _validate(self, assay_info_df):
131+
def _validate(
132+
self, assay_info_df: pd.DataFrame, skip_database_checks: bool
133+
) -> Tuple[str, str]:
131134
"""
132135
Validates the values of assay information file
133136
134137
Args:
135-
assay_info_df: assay information dataframe
138+
assay_info_df (pd.DataFrame): input assay information dataframe
139+
skip_database_checks (bool): Whether to skip certain validation checks
140+
since they requires access to the internal database tables
136141
137142
Returns:
138-
tuple: error and warning
143+
Tuple[str, str]: complete error and warning messages
139144
"""
140145

141146
total_error = ""
@@ -153,25 +158,9 @@ def _validate(self, assay_info_df):
153158
"SEQ_ASSAY_IDs start with your center abbreviation.\n"
154159
)
155160

156-
uniq_seq_df = extract.get_syntabledf(
157-
self.syn,
158-
f"select distinct(SEQ_ASSAY_ID) as seq from {self.genie_config['sample']} "
159-
f"where CENTER = '{self.center}'",
161+
total_error += self.validate_all_seq_assay_ids_exist_in_clinical_database(
162+
all_seq_assays=all_seq_assays, skip_database_checks=skip_database_checks
160163
)
161-
# These are all the SEQ_ASSAY_IDs that are in the clinical database
162-
# but not in the assay_information file
163-
missing_seqs = uniq_seq_df["seq"][
164-
~uniq_seq_df["seq"]
165-
.replace({"_": "-"}, regex=True)
166-
.str.upper()
167-
.isin(all_seq_assays)
168-
]
169-
missing_seqs_str = ", ".join(missing_seqs)
170-
if missing_seqs.to_list():
171-
total_error += (
172-
"Assay_information.yaml: You are missing SEQ_ASSAY_IDs: "
173-
f"{missing_seqs_str}\n"
174-
)
175164

176165
else:
177166
total_error += "Assay_information.yaml: Must have SEQ_ASSAY_ID column.\n"
@@ -390,3 +379,49 @@ def _validate(self, assay_info_df):
390379
total_error += error
391380

392381
return total_error, warning
382+
383+
def validate_all_seq_assay_ids_exist_in_clinical_database(
384+
self, all_seq_assays: dict, skip_database_checks: bool
385+
) -> str:
386+
"""Validates that all SEQ_ASSAY_IDs in the clinical sample database
387+
for that center exists in the assay information file for that center.
388+
389+
**Conditions**
390+
391+
| Condition | Result |
392+
|---|:---:|
393+
| Assay information file has more SEQ_ASSAY_IDs than in clinical database | ✅ PASS |
394+
| Assay information file has the same SEQ_ASSAY_IDs as in clinical database | ✅ PASS |
395+
| Assay information file has less SEQ_ASSAY_IDs than in clinical database | ❌ FAIL |
396+
397+
Args:
398+
all_seq_assays (dict): list of all the SEQ_ASSAY_IDs in
399+
the assay information file
400+
skip_database_checks (bool): Whether to skip this validation check
401+
since it requires access to the internal clinical sample database
402+
403+
Returns:
404+
str: error message
405+
"""
406+
error = ""
407+
if not skip_database_checks:
408+
uniq_seq_df = extract.get_syntabledf(
409+
self.syn,
410+
f"select distinct(SEQ_ASSAY_ID) as seq from {self.genie_config['sample']} "
411+
f"where CENTER = '{self.center}'",
412+
)
413+
# These are all the SEQ_ASSAY_IDs that are in the clinical database
414+
# but not in the assay_information file
415+
missing_seqs = uniq_seq_df["seq"][
416+
~uniq_seq_df["seq"]
417+
.replace({"_": "-"}, regex=True)
418+
.str.upper()
419+
.isin(all_seq_assays)
420+
]
421+
missing_seqs_str = ", ".join(missing_seqs)
422+
if missing_seqs.to_list():
423+
error += (
424+
"Assay_information.yaml: You are missing SEQ_ASSAY_IDs: "
425+
f"{missing_seqs_str}\n"
426+
)
427+
return error

genie_registry/cna.py

Lines changed: 58 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import logging
22
import os
3-
from typing import Union
3+
from typing import Tuple, Union
44

55
import pandas as pd
66
import synapseclient
@@ -114,7 +114,7 @@ class cna(FileTypeFormat):
114114

115115
_process_kwargs = ["newPath"]
116116

117-
_validation_kwargs = ["nosymbol_check"]
117+
_validation_kwargs = ["skip_database_checks"]
118118

119119
# VALIDATE FILENAME
120120
def _validateFilename(self, filePath):
@@ -175,7 +175,18 @@ def process_steps(self, cnaDf, newPath):
175175
self.syn.store(synapseclient.File(newPath, parent=centerMafSynId))
176176
return newPath
177177

178-
def _validate(self, cnvDF, nosymbol_check):
178+
def _validate(self, cnvDF: pd.DataFrame, skip_database_checks: bool) -> Tuple:
179+
"""
180+
Validates the values of the input cna file
181+
182+
Args:
183+
cnvDF (pd.DataFrame): input CNA file
184+
skip_database_checks (bool): Whether to skip this validation check
185+
since it requires access to the internal clinical sample database
186+
187+
Returns:
188+
Tuple: complete error and warning messages
189+
"""
179190
total_error = ""
180191
warning = ""
181192
cnvDF.columns = [col.upper() for col in cnvDF.columns]
@@ -220,27 +231,49 @@ def _validate(self, cnvDF, nosymbol_check):
220231
)
221232
else:
222233
cnvDF["HUGO_SYMBOL"] = keepSymbols
223-
if haveColumn and not nosymbol_check:
224-
bedSynId = self.genie_config["bed"]
225-
bed = self.syn.tableQuery(
226-
f"select Hugo_Symbol, ID from {bedSynId} "
227-
f"where CENTER = '{self.center}'"
228-
)
229-
bedDf = bed.asDataFrame()
230-
cnvDF["remapped"] = cnvDF["HUGO_SYMBOL"].apply(
231-
lambda x: validateSymbol(x, bedDf)
234+
if haveColumn:
235+
total_error += self.validate_no_dup_symbols_after_remapping(
236+
cnvDF=cnvDF, skip_database_checks=skip_database_checks
232237
)
233-
cnvDF = cnvDF[~cnvDF["remapped"].isnull()]
234-
235-
# Do not allow any duplicated genes after symbols
236-
# have been remapped
237-
if sum(cnvDF["remapped"].duplicated()) > 0:
238-
duplicated = cnvDF["remapped"].duplicated(keep=False)
239-
total_error += (
240-
"Your CNA file has duplicated Hugo_Symbols "
241-
"(After remapping of genes): {} -> {}.\n".format(
242-
",".join(cnvDF["HUGO_SYMBOL"][duplicated]),
243-
",".join(cnvDF["remapped"][duplicated]),
244-
)
245-
)
246238
return (total_error, warning)
239+
240+
def validate_no_dup_symbols_after_remapping(
241+
self, cnvDF: pd.DataFrame, skip_database_checks: bool
242+
) -> str:
243+
"""Validates that there are no duplicated Hugo_Symbol values
244+
after remapping the previous Hugo_Symbol column using the
245+
bed database table. See validateSymbol for more details
246+
on the remapping method.
247+
248+
Args:
249+
skip_database_checks (bool): Whether to skip this validation check
250+
since it requires access to the internal bed database
251+
252+
Returns:
253+
str: error message
254+
"""
255+
error = ""
256+
if not skip_database_checks:
257+
bedSynId = self.genie_config["bed"]
258+
bed = self.syn.tableQuery(
259+
f"select Hugo_Symbol, ID from {bedSynId} "
260+
f"where CENTER = '{self.center}'"
261+
)
262+
bedDf = bed.asDataFrame()
263+
cnvDF["remapped"] = cnvDF["HUGO_SYMBOL"].apply(
264+
lambda x: validateSymbol(x, bedDf)
265+
)
266+
cnvDF = cnvDF[~cnvDF["remapped"].isnull()]
267+
268+
# Do not allow any duplicated genes after symbols
269+
# have been remapped
270+
if sum(cnvDF["remapped"].duplicated()) > 0:
271+
duplicated = cnvDF["remapped"].duplicated(keep=False)
272+
error += (
273+
"Your CNA file has duplicated Hugo_Symbols "
274+
"(After remapping of genes): {} -> {}.\n".format(
275+
",".join(cnvDF["HUGO_SYMBOL"][duplicated]),
276+
",".join(cnvDF["remapped"][duplicated]),
277+
)
278+
)
279+
return error

0 commit comments

Comments
 (0)