6
6
7
7
A transcriptome can describe the total state of a tumor at a snapshot
8
8
in time. In this repository, we use cancer transcriptomes from The Cancer
9
- Genome Atlas Pan Cancer dataset to interrogate gene expression states induced
10
- by deleterious mutations and copy number alterations.
11
-
12
- We have previously described the ability of a machine learning classifier to
13
- detect an NF1 inactivation signature using Glioblastoma data
14
- ([ Way _ et al._ 2016] ( http://doi.org/10.1186/s12864-017-3519-7 ) ). We applied an
15
- ensemble of logistic regression classifiers to the problem, but the solutions were
16
- unstable and overfit. To address these issues, we posited that we could leverage
17
- data from diverse tissue-types to build a pancancer NF1 classifier. We also
18
- hypothesized that a RAS classifier would be able to detect tumors with NF1
19
- inactivation since NF1 directly inhibits RAS activity and there are many more
20
- examples of samples with RAS mutations.
9
+ Genome Atlas Pan Cancer consortium to interrogate gene expression states
10
+ induced by deleterious mutations and copy number alterations.
21
11
22
12
The code in this repository is flexible and can build a Pan-Cancer classifier
23
13
for any combination of genes and cancer types using gene expression, mutation,
24
- and copy number data. Currently, we build classifiers to detect NF1/RAS
25
- aberration and TP53 inactivation.
14
+ and copy number data. In this repository, we provide examples for building
15
+ classifiers to detect aberration in _ TP53_ and _ NF1_ /RAS signalling.
16
+
17
+ We have previously described the ability of a machine learning classifier to
18
+ detect an _ NF1_ inactivation signature using Glioblastoma data
19
+ ([ Way _ et al._ 2016] ( http://doi.org/10.1186/s12864-017-3519-7 ) ). We applied an
20
+ ensemble of logistic regression classifiers to the problem, but the solutions
21
+ were unstable and overfit. To address these issues, we posited that we could
22
+ leverage data from diverse cancer types to build a pancancer _ NF1_ classifier.
23
+ We also hypothesized that a RAS classifier would be able to detect tumors with
24
+ _ NF1_ inactivation since _ NF1_ directly inhibits RAS activity and there are
25
+ many more examples of samples with RAS mutations.
26
26
27
27
## Controlled Access Data
28
28
@@ -38,26 +38,17 @@ Eventually, all of the controlled access data used in this pipeline will be
38
38
made public. ** We will update this database when the data is officially
39
39
released.**
40
40
41
- ## Cancer Genes
42
-
43
- Note that in order to use the copy number integration feature, an additional
44
- file must be downloaded. The file is ` Supplementary Table S2 ` of
45
- [ Vogelstein _ et al._ 2013] ( "http://doi.org/10.1126/science.1235122" ) .
46
-
47
- Processed data is located here: ` data/vogelstein_cancergenes.tsv `
48
-
49
41
## Usage
50
42
51
43
### Initialization
52
44
53
- The pipeline must first be initialized before use. Initialization will
54
- download and process data and setup computational environment.
45
+ The pipeline must be initialized before use. Initialization will download and
46
+ process data and setup computational environment.
55
47
56
- To initialize enter the following in the command line:
48
+ To initialize, enter the following in the command line:
57
49
58
50
``` sh
59
51
# Login to synapse to download controlled-access data
60
- # Note, publicly available Xena data is also available for download
61
52
synapse login
62
53
63
54
# Create and activate conda environment
@@ -70,37 +61,38 @@ source activate pancancer-classifier
70
61
71
62
### Example Scripts
72
63
73
- We provide two distinct example pipelines for predicting TP53 and RAS/NF1
64
+ We provide two distinct example pipelines for predicting _ TP53 _ and _ NF1 _ /RAS
74
65
loss of function.
75
66
76
- 1 . TP53 loss of function (see [ tp53_analysis.sh] ( tp53_analysis.sh ) )
77
- 2 . RAS/NF1 loss of function (see [ ras_nf1_analysis.sh] ( ras_nf1_analysis.sh ) )
67
+ 1 . _ TP53 _ loss of function (see [ tp53_analysis.sh] ( tp53_analysis.sh ) )
68
+ 2 . _ NF1 _ /RAS loss of function (see [ ras_nf1_analysis.sh] ( ras_nf1_analysis.sh ) )
78
69
79
70
### Customization
80
71
81
- For custom analyses, use the ` pancancer_classifier.py ` script with command line
82
- arguments.
72
+ For custom analyses, use the
73
+ [ scripts/pancancer_classifier.py] ( scripts/pancancer_classifier.py ) script with
74
+ command line arguments.
83
75
84
76
```
85
- python pancancer_classifier.py ...
77
+ python scripts/ pancancer_classifier.py ...
86
78
```
87
79
88
- | Flag | Abbreviation | Required/Default | Description |
89
- | ---- | :----------: | :-- ----: | ----------- |
90
- | ` genes ` | ` -g ` | REQUIRED | Build a classifier for the input gene symbols |
91
- | ` tissues ` | ` -t ` | ` Auto ` | The tissues to use in building the classifier |
92
- | ` folds ` | ` -f ` | ` 5 ` | Number of cross validation folds |
93
- | ` drop ` | ` -d ` | ` False ` | Decision to drop input genes from expression matrix |
94
- | ` copy_number ` | ` -u ` | ` False ` | Integrate copy number data to gene event |
95
- | ` filter_count ` | ` -c ` | ` 15 ` | Default options to filter tissues if none are specified |
96
- | ` filter_prop ` | ` -p ` | ` 0.05 ` | Default options to filter tissues if none are specified |
97
- | ` num_features ` | ` -n ` | ` 8000 ` | Number of MAD genes used to build classifier |
98
- | ` alphas ` | ` -a ` | ` 0.01,0. 1,0.15,0.2,0.5,0.8` | The alpha grid to search over in parameter sweep |
99
- | ` l1_ratios ` | ` -l ` | ` 0,0.1,0.15,0.18,0.2,0.3 ` | The l1 ratio grid to search over in parameter sweep |
100
- | ` alt_genes ` | ` -b ` | ` None ` | Alternative genes to test classifier performance |
101
- | ` alt_tissues ` | ` -s ` | ` Auto ` | Alternative tissues to test classifier performance |
102
- | ` alt_tissue_count ` | ` -i ` | ` 15 ` | Filtering used for alternative tissue classification |
103
- | ` alt_filter_prop ` | ` -r ` | ` 0.05 ` | Filtering used for alternative tissue classification |
104
- | ` alt_folder ` | ` -o ` | ` Auto ` | Location to save all classifier figures |
105
- | ` xena ` | ` -x ` | ` False ` | If present, use publicly available data for building classifier |
80
+ | Flag | Required/Default | Description |
81
+ | ---- | :--------------: | ----------- |
82
+ | ` -- genes` | Required | Build a classifier for the input gene symbols |
83
+ | ` --diseases ` | ` Auto ` | The disease types to use in building the classifier |
84
+ | ` --folds ` | ` 5 ` | Number of cross validation folds |
85
+ | ` -- drop` | ` False ` | Decision to drop input genes from expression matrix |
86
+ | ` -- copy_number` | ` False ` | Integrate copy number data to gene event |
87
+ | ` -- filter_count` | ` 15 ` | Default options to filter diseases if none are specified |
88
+ | ` -- filter_prop` | ` 0.05 ` | Default options to filter diseases if none are specified |
89
+ | ` -- num_features` | ` 8000 ` | Number of MAD genes used to build classifier |
90
+ | ` -- alphas` | ` 0. 1,0.15,0.2,0.5,0.8,1 ` | The alpha grid to search over in parameter sweep |
91
+ | ` --l1_ratios ` | ` 0,0.1,0.15,0.18,0.2,0.3 ` | The l1 ratio grid to search over in parameter sweep |
92
+ | ` --alt_genes ` | ` None ` | Alternative genes to test classifier performance |
93
+ | ` --alt_diseases ` | ` Auto ` | Alternative diseases to test classifier performance |
94
+ | ` --alt_filter_count ` | ` 15 ` | Filtering used for alternative disease classification |
95
+ | ` -- alt_filter_prop` | ` 0.05 ` | Filtering used for alternative disease classification |
96
+ | ` --alt_folder ` | ` Auto ` | Location to save all classifier figures |
97
+ | ` --remove_hyper ` | ` False ` | Decision to remove hyper mutated tumors |
106
98
0 commit comments