Skip to content

Commit c051c95

Browse files
authored
Publication organization (#23)
* update README for release * update analysis script for only classifier building * add bash script for building TP53 figures * remove reference, no longer applicable * change tissue to cancer in readme * update readme * update readme * update readme * update readme * update readme
1 parent f010dfd commit c051c95

File tree

4 files changed

+90
-72
lines changed

4 files changed

+90
-72
lines changed

README.md

+42-50
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,23 @@
66

77
A transcriptome can describe the total state of a tumor at a snapshot
88
in time. In this repository, we use cancer transcriptomes from The Cancer
9-
Genome Atlas Pan Cancer dataset to interrogate gene expression states induced
10-
by deleterious mutations and copy number alterations.
11-
12-
We have previously described the ability of a machine learning classifier to
13-
detect an NF1 inactivation signature using Glioblastoma data
14-
([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
15-
ensemble of logistic regression classifiers to the problem, but the solutions were
16-
unstable and overfit. To address these issues, we posited that we could leverage
17-
data from diverse tissue-types to build a pancancer NF1 classifier. We also
18-
hypothesized that a RAS classifier would be able to detect tumors with NF1
19-
inactivation since NF1 directly inhibits RAS activity and there are many more
20-
examples of samples with RAS mutations.
9+
Genome Atlas Pan Cancer consortium to interrogate gene expression states
10+
induced by deleterious mutations and copy number alterations.
2111

2212
The code in this repository is flexible and can build a Pan-Cancer classifier
2313
for any combination of genes and cancer types using gene expression, mutation,
24-
and copy number data. Currently, we build classifiers to detect NF1/RAS
25-
aberration and TP53 inactivation.
14+
and copy number data. In this repository, we provide examples for building
15+
classifiers to detect aberration in _TP53_ and _NF1_/RAS signalling.
16+
17+
We have previously described the ability of a machine learning classifier to
18+
detect an _NF1_ inactivation signature using Glioblastoma data
19+
([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
20+
ensemble of logistic regression classifiers to the problem, but the solutions
21+
were unstable and overfit. To address these issues, we posited that we could
22+
leverage data from diverse cancer types to build a pancancer _NF1_ classifier.
23+
We also hypothesized that a RAS classifier would be able to detect tumors with
24+
_NF1_ inactivation since _NF1_ directly inhibits RAS activity and there are
25+
many more examples of samples with RAS mutations.
2626

2727
## Controlled Access Data
2828

@@ -38,26 +38,17 @@ Eventually, all of the controlled access data used in this pipeline will be
3838
made public. **We will update this database when the data is officially
3939
released.**
4040

41-
## Cancer Genes
42-
43-
Note that in order to use the copy number integration feature, an additional
44-
file must be downloaded. The file is `Supplementary Table S2` of
45-
[Vogelstein _et al._ 2013]("http://doi.org/10.1126/science.1235122").
46-
47-
Processed data is located here: `data/vogelstein_cancergenes.tsv`
48-
4941
## Usage
5042

5143
### Initialization
5244

53-
The pipeline must first be initialized before use. Initialization will
54-
download and process data and setup computational environment.
45+
The pipeline must be initialized before use. Initialization will download and
46+
process data and setup computational environment.
5547

56-
To initialize enter the following in the command line:
48+
To initialize, enter the following in the command line:
5749

5850
```sh
5951
# Login to synapse to download controlled-access data
60-
# Note, publicly available Xena data is also available for download
6152
synapse login
6253

6354
# Create and activate conda environment
@@ -70,37 +61,38 @@ source activate pancancer-classifier
7061

7162
### Example Scripts
7263

73-
We provide two distinct example pipelines for predicting TP53 and RAS/NF1
64+
We provide two distinct example pipelines for predicting _TP53_ and _NF1_/RAS
7465
loss of function.
7566

76-
1. TP53 loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
77-
2. RAS/NF1 loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))
67+
1. _TP53_ loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
68+
2. _NF1_/RAS loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))
7869

7970
### Customization
8071

81-
For custom analyses, use the `pancancer_classifier.py` script with command line
82-
arguments.
72+
For custom analyses, use the
73+
[scripts/pancancer_classifier.py](scripts/pancancer_classifier.py) script with
74+
command line arguments.
8375

8476
```
85-
python pancancer_classifier.py ...
77+
python scripts/pancancer_classifier.py ...
8678
```
8779

88-
| Flag | Abbreviation | Required/Default | Description |
89-
| ---- | :----------: | :------: | ----------- |
90-
| `genes` | `-g` | REQUIRED | Build a classifier for the input gene symbols |
91-
| `tissues` | `-t` | `Auto` | The tissues to use in building the classifier |
92-
| `folds` | `-f` | `5` | Number of cross validation folds |
93-
| `drop` | `-d` | `False` | Decision to drop input genes from expression matrix |
94-
| `copy_number` | `-u` | `False` | Integrate copy number data to gene event |
95-
| `filter_count` | `-c` | `15` | Default options to filter tissues if none are specified |
96-
| `filter_prop` | `-p` | `0.05` | Default options to filter tissues if none are specified |
97-
| `num_features` | `-n` | `8000` | Number of MAD genes used to build classifier |
98-
| `alphas` | `-a` | `0.01,0.1,0.15,0.2,0.5,0.8` | The alpha grid to search over in parameter sweep |
99-
| `l1_ratios` | `-l` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
100-
| `alt_genes` | `-b` | `None` | Alternative genes to test classifier performance |
101-
| `alt_tissues` | `-s` | `Auto` | Alternative tissues to test classifier performance |
102-
| `alt_tissue_count` | `-i` | `15` | Filtering used for alternative tissue classification |
103-
| `alt_filter_prop` | `-r` | `0.05` | Filtering used for alternative tissue classification |
104-
| `alt_folder` | `-o` | `Auto` | Location to save all classifier figures |
105-
| `xena` | `-x` | `False` | If present, use publicly available data for building classifier |
80+
| Flag | Required/Default | Description |
81+
| ---- | :--------------: | ----------- |
82+
| `--genes` | Required | Build a classifier for the input gene symbols |
83+
| `--diseases` | `Auto` | The disease types to use in building the classifier |
84+
| `--folds` | `5` | Number of cross validation folds |
85+
| `--drop` | `False` | Decision to drop input genes from expression matrix |
86+
| `--copy_number` | `False` | Integrate copy number data to gene event |
87+
| `--filter_count` | `15` | Default options to filter diseases if none are specified |
88+
| `--filter_prop` | `0.05` | Default options to filter diseases if none are specified |
89+
| `--num_features` | `8000` | Number of MAD genes used to build classifier |
90+
| `--alphas` | `0.1,0.15,0.2,0.5,0.8,1` | The alpha grid to search over in parameter sweep |
91+
| `--l1_ratios` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
92+
| `--alt_genes` | `None` | Alternative genes to test classifier performance |
93+
| `--alt_diseases` | `Auto` | Alternative diseases to test classifier performance |
94+
| `--alt_filter_count` | `15` | Filtering used for alternative disease classification |
95+
| `--alt_filter_prop` | `0.05` | Filtering used for alternative disease classification |
96+
| `--alt_folder` | `Auto` | Location to save all classifier figures |
97+
| `--remove_hyper` | `False` | Decision to remove hyper mutated tumors |
10698

scripts/pancancer_classifier.py

-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
"""
22
Gregory Way 2017
3-
Heavily modified from https://github.com/cognoma/machine-learning/
43
PanCancer Classifier
54
scripts/pancancer_classifier.py
65

scripts/tp53_ddr_figures.sh

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
# Pipeline to reproduce figures for the DNA Damage Repair manuscript
4+
#
5+
# Usage: bash scripts/tp53_ddr_figures.sh
6+
#
7+
# Output: summarizes the results of the TP53 classifier and outputs
8+
# several figures and tables
9+
10+
tp53_dir='classifiers/TP53'
11+
12+
# 1. Apply PanCan classifier to all samples and output scores for each sample
13+
python scripts/apply_weights.py --classifier $tp53_dir --copy_number
14+
15+
# 2. Summarize and visualize performance of classifiers
16+
python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
17+
python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
18+
Rscript --vanilla scripts/ddr_summary_figures.R
19+
Rscript --vanilla scripts/compare_within_models.R \
20+
--within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir
21+
22+
# 3. Perform Snaptron analysis
23+
# NOTE: Snaptron setup must be performed first. See `pancancer/scripts/snaptron/`
24+
bash dna_damage_repair_tp53exon.sh
25+
26+
# 4. Perform copy number burden analysis
27+
python scripts/copy_burden_merge.py --classifier_folder $tp53_dir
28+
Rscript --vanilla scripts/copy_burden_figures.R
29+

tp53_analysis.sh

+19-21
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,8 @@
44
#
55
# Usage: bash tp53_analysis.sh
66
#
7-
# Output: will run all specified classifiers which will output performance plots
8-
# and summarize how a machine learning classifier can detect aberrant
9-
# TP53 activity RNAseq, copy number, and gene expression.
7+
# Output: Will train a pan cancer model to detect TP53 aberration. Will also
8+
# train a unique classifier within each specific cancer type
109

1110
# Set Constants
1211
tp53_diseases='BLCA,BRCA,CESC,COAD,ESCA,GBM,HNSC,KICH,LGG,LIHC,LUAD,LUSC,'\
@@ -15,24 +14,23 @@ alphas='0.1,0.13,0.15,0.18,0.2,0.3,0.4,0.6,0.7'
1514
l1_mixing='0.1,0.125,0.15,0.2,0.25,0.3,0.35'
1615
tp53_dir='classifiers/TP53'
1716

18-
# 1. PanCancer TP53 classification
19-
python scripts/pancancer_classifier.py --genes 'TP53' --diseases $tp53_diseases \
20-
--drop --copy_number --remove_hyper --alt_folder $tp53_dir \
21-
--alphas $alphas --l1_ratios $l1_mixing
17+
# Pan Cancer TP53 classification
18+
python scripts/pancancer_classifier.py \
19+
--genes 'TP53' \
20+
--diseases $tp53_diseases \
21+
--drop \
22+
--copy_number \
23+
--remove_hyper \
24+
--alt_folder $tp53_dir \
25+
--alphas $alphas \
26+
--l1_ratios $l1_mixing
2227

23-
# 2. Within disease type TP53 classification
24-
python scripts/within_tissue_analysis.py --genes 'TP53' \
25-
--diseases $tp53_diseases --remove_hyper \
28+
# Within Disease type TP53 classification
29+
python scripts/within_tissue_analysis.py \
30+
--genes 'TP53' \
31+
--diseases $tp53_diseases \
32+
--remove_hyper \
2633
--alt_folder $tp53_dir'/within_disease' \
27-
--alphas $alphas --l1_ratios $l1_mixing
28-
29-
# 3. Apply PanCan classifier to all samples and output scores for each sample
30-
python scripts/apply_weights.py --classifier $tp53_dir --copy_number
31-
32-
# 4. Summarize and visualize performance of classifiers
33-
python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
34-
python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
35-
Rscript --vanilla scripts/ddr_summary_figures.R
36-
Rscript --vanilla scripts/compare_within_models.R \
37-
--within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir
34+
--alphas $alphas \
35+
--l1_ratios $l1_mixing
3836

0 commit comments

Comments
 (0)