Skip to content
This repository was archived by the owner on Oct 7, 2021. It is now read-only.

Commit 9badff5

Browse files
authored
Merge pull request icgc-argo-workflows#14 from ICGC-ARGO-Structural-Variation-CN-WG/facets@0.3.0
Facets@0.3.0 [release]
2 parents 7783922 + 4b8165d commit 9badff5

16 files changed

Lines changed: 707 additions & 0 deletions

facets/.dockerignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.gitignore
2+
.nextflow*
3+
tests
4+
work
5+
outdir

facets/Dockerfile

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM continuumio/miniconda3:4.9.2
2+
3+
# filled by wfpm
4+
LABEL org.opencontainers.image.source https://github.com/icgc-argo-structural-variation-cn-wg/icgc-argo-sv-copy-number
5+
6+
# add ps (required by nextflow)
7+
RUN apt-get --allow-releaseinfo-change update && \
8+
apt-get install -y procps && \
9+
apt-get clean && \
10+
rm -rf /var/lib/apt/lists/*
11+
12+
# install facets and dependencies
13+
RUN /opt/conda/bin/conda install --yes -c conda-forge r-base=4.0.3 r-optparse r-rcolorbrewer r-plyr r-dplyr r-tidyr r-stringr r-magrittr r-foreach
14+
RUN /opt/conda/bin/conda install --yes -c bioconda r-facets=0.6.1 snp-pileup=0.6.1
15+
16+
# Add main wrapper:
17+
RUN mkdir -p /tools
18+
ENV PATH="/tools:${PATH}"
19+
COPY facetsRun.R /tools/
20+
21+
ENTRYPOINT ["/usr/bin/env"]
22+
23+
CMD ["/bin/bash"]

facets/README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# FACETS
2+
3+
FACETS (Fraction and Allele specific Copy number Estimate from Tumor/normal Sequencing) infers allele-specific DNA copy number and clonal heterogeneity from high-throughput sequencing including whole-genome, whole-exome, and some targeted cancer gene panels. The method implements a bivariate genome segmentation, followed by allele-specific copy number calls. Tumor purity,ploidy, and cellular fractions are estimated and reported from the output. This tool is useful to simplify large-scale application providing comprehensive output, and integrated visualization.
4+
5+
Read more: [https://github.com/mskcc/facets/](https://github.com/mskcc/facets/)
6+
7+
## Usage
8+
9+
The typical command for running the pipeline is as follows:
10+
11+
```
12+
nextflow run wes-postproc/modules/facets --input input.txt -profile cluster,singularity
13+
```
14+
15+
Mandatory arguments:
16+
```
17+
--input Tab delimited file (no header), with paths to following files:
18+
tumor_ID normal_ID tumor.bam normal.bam target.dbsnp
19+
```
20+
21+
Optional arguments:
22+
```
23+
--snp_pileup Full path to the folder containing the snp_pileup files (you might want to use this when re-running facets)
24+
--summaryPrefix Prefix for the summary files [all.geneCN]
25+
--q (snp-pileup) Sets the minimum threshold for mapping quality [1]
26+
--Q (snp-pileup) Sets the minimum threshold for base quality [13]
27+
--r (snp-pileup) Comma separated list of minimum read counts for a position to be output [25,0]
28+
--d (snp-pileup) Sets the maximum depth [1000]
29+
--genome Genome build (b37, GRCh37, hg19, mm9, mm10, GRCm38, hg38). [hg38]
30+
--seed [1234]
31+
--snp_nbhd Window size [250]
32+
--minNDepth Minimum depth in normal to keep the position [25]
33+
--maxNDepth Maximum depth in normal to keep the position [1000]
34+
--pre_cval Pre-processing critical value [cval1 - 50]
35+
--cval1 Critical value for estimating diploid log Ratio [200]
36+
--cval2 Starting critical value for segmentation (increases by 25 until success) [cval1 - 50]
37+
--max_cval Maximum critical value for segmentation (increases by 25 until success) [5000]
38+
--min_nhet Minimum number of heterozygote snps in a segment used for bivariate t-statistic during clustering of segment [25]
39+
--unmatched Is it unmatched? [FALSE]
40+
--minGC Min GC of position [0]
41+
--maxGC Max GC of position [1]
42+
```
43+
44+
## Output
45+
```
46+
./facets_out/snp_pileup .................. pileup files for every sample.
47+
{tumor_id}__{normal_id}__q{params.q}_Q{params.Q}_d{params.maxNDepth}_r{params.r}.bc.gz
48+
49+
50+
51+
./facets_out/cval1{params.cval1} .......... FACETS results for every sample.
52+
{tumor_id}__{normal_id}.cncf.pdf ...... genome-wide profile. Figures:
53+
log-ratio: logR with chromosomes alternating in blue and gray. The green line indicates the median logR in the sample. The purple line indicates the logR of the diploid state.
54+
log-odds-ratio: Segment means are ploted in red lines.
55+
copy number (em): plots the total (black) and minor (red) copy number for each segment.
56+
cf-em: shows the associated cellular fraction (cf). Dark blue indicates high cf. Light blue indicates low cf. Beige indicates a normal segment (total=2,minor=1).
57+
{tumor_id}__{normal_id}.cncf.txt ...... FACETS result table. The columns are:
58+
chrom: the chromosome to which the segment belongs.seg: the segment number.
59+
num.mark: the number of SNPs in the segment.
60+
nhet: the number of SNPs that are deemed heterozygous.
61+
cnlr.median: the median log-ratio of the segment.
62+
mafR: the log-odds-ratio summary for the segment (close to zero means the alleles are in balance).
63+
segclust: the segment cluster to which segment belongs.
64+
cnlr.median.clust: the median log-ratio of the segment cluster.
65+
mafR.clust: the log-odds-ratio summary for the segment cluster.
66+
cf.em: the cellular fraction of the segment.
67+
tcn.em: the total copy number of the segment.
68+
lcn.em: the minor copy number of the segment.
69+
{tumor_id}__{normal_id}.logR.pdf ...... genome-wide profile log-ratio only.
70+
{tumor_id}__{normal_id}.out ........... result summary file and log
71+
{tumor_id}__{normal_id}.Rdata ......... FACETS R session
72+
73+
```
74+
75+
## Fetching the singularity container
76+
```
77+
bash scripts/fetch_image.sh
78+
```
79+
80+
## Fetching resource files
81+
```
82+
bash scripts/fetch_resources.sh
83+
```

facets/facetsRun.R

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
#!/usr/bin/env Rscript
2+
3+
# run the facets library
4+
5+
# Version changelog:
6+
# v2:
7+
# Sourcing runFacets_myplot.R from the same folder of this script, wherever that might be.
8+
# v2.1:
9+
# Added '--tumorName' and '--normalName' options to account for different naming schemes.
10+
# Account for the possibility that '--cval2' and '--pre_cval' are passed with a string 'NULL'
11+
# v3:
12+
# set seed
13+
# use a default pre_cval
14+
# use only one cval (remove cval2; cval1 -> cval)
15+
# increase cval by 50 if hyperfragmented (save as additional result files).
16+
# add max_segs to define hyperfragmentation.
17+
# v3.icgc-argo:
18+
# remove normalName
19+
# no cval increase steps
20+
# omit runFacets_myplot.R and plotting only logR.
21+
22+
suppressPackageStartupMessages(library("optparse"));
23+
suppressPackageStartupMessages(library("RColorBrewer"));
24+
suppressPackageStartupMessages(library("plyr"));
25+
suppressPackageStartupMessages(library("dplyr"));
26+
suppressPackageStartupMessages(library("tidyr"));
27+
suppressPackageStartupMessages(library("stringr"));
28+
suppressPackageStartupMessages(library("magrittr"));
29+
suppressPackageStartupMessages(library("facets"));
30+
suppressPackageStartupMessages(library("foreach"));
31+
32+
33+
34+
35+
if (!interactive()) {
36+
options(warn = -1, error = quote({ traceback(); q('no', status = 1) }))
37+
}
38+
39+
optList <- list(
40+
make_option("--seed", default = 1234, type = 'integer', help = "seed for reproducibility"),
41+
make_option("--snp_nbhd", default = 250, type = 'integer', help = "window size"),
42+
make_option("--minNDepth", default = 5, type = 'integer', help = "minimum depth in normal to keep the position"),
43+
make_option("--maxNDepth", default= 500, type= 'integer', help = "maximum depth in normal to keep the position"),
44+
make_option("--pre_cval", default = 80, type = 'integer', help = "pre-processing critical value"),
45+
make_option("--cval", default = NULL, type = 'integer', help = "critical value for estimating diploid log Ratio"),
46+
make_option("--max_cval", default = 5000, type = 'integer', help = "maximum critical value for segmentation (increases by 100 until success)"),
47+
make_option("--min_nhet", default = 25, type = 'integer', help = "minimum number of heterozygote snps in a segment used for bivariate t-statistic during clustering of segment"),
48+
make_option("--genome", default = 'hg38', type = 'character', help = "genome of counts file"),
49+
make_option("--unmatched", default=FALSE, type=NULL, help="is it unmatched?"),
50+
make_option("--minGC", default = 0, type = NULL, help = "min GC of position"),
51+
make_option("--maxGC", default = 1, type = NULL, help = "max GC of position"),
52+
make_option("--max_segs", default = 3000, type = 'integer', help = "max number of segments to avoid hyperfragmentation"),
53+
make_option("--outPrefix", default = NULL, help = "output prefix"),
54+
make_option("--tumorName", default = NULL, help = "tumorName")
55+
)
56+
57+
parser <- OptionParser(usage = "%prog [options] [tumor-normal base counts file]", option_list = optList);
58+
59+
arguments <- parse_args(parser, positional_arguments = T);
60+
opt <- arguments$options;
61+
62+
if (length(arguments$args) < 1) {
63+
cat("Need base counts file\n")
64+
print_help(parser);
65+
stop();
66+
} else if (is.null(opt$outPrefix)) {
67+
cat("Need output prefix\n")
68+
print_help(parser);
69+
stop();
70+
} else if (is.null(opt$tumorName)) {
71+
cat("Need tumorName\n")
72+
print_help(parser);
73+
stop();
74+
} else {
75+
baseCountFile <- arguments$args[1];
76+
}
77+
78+
# Print input file and the options
79+
cat("\nInput file:\n",baseCountFile,"\n")
80+
cat("\nOptions:\n")
81+
for(i in 1:length(opt))
82+
{
83+
cat("",names(opt[i]), "=", head(opt[[i]],1),"\n")
84+
}
85+
cat("\n")
86+
87+
switch(opt$genome,
88+
b37={gbuild="hg19"},
89+
b37_hbv_hcv={gbuild="hg19"},
90+
GRCh37={gbuild="hg19"},
91+
hg19={gbuild="hg19"},
92+
hg19_ionref={gbuild="hg19"},
93+
mm9={gbuild="mm9"},
94+
mm10={gbuild="mm10"},
95+
GRCm38={gbuild="mm10"},
96+
hg38={gbuild="hg38"},
97+
{ stop(paste("Invalid Genome",opt$genome)) })
98+
99+
buildData=installed.packages()["facets",]
100+
cat("#Module Info\n")
101+
for(fi in c("Package","LibPath","Version","Built")){
102+
cat("#",paste(fi,":",sep=""),buildData[fi],"\n")
103+
}
104+
version=buildData["Version"]
105+
cat("\n")
106+
107+
rcmat <- readSnpMatrix(gzfile(baseCountFile))
108+
chromLevels=unique(rcmat[,1])
109+
print(chromLevels)
110+
if (gbuild %in% c("hg19", "hg18")) { chromLevels=intersect(chromLevels, c(1:22,"X"))
111+
} else { chromLevels=intersect(chromLevels, c(1:19,"X"))}
112+
print(chromLevels)
113+
114+
if(is.null(opt$cval)) { stop("cval cannot be NULL")}
115+
116+
set.seed(opt$seed)
117+
118+
if (opt$minGC == 0 & opt$maxGC == 1) {
119+
preOut=preProcSample(rcmat, snp.nbhd = opt$snp_nbhd, ndepth = opt$minNDepth, cval = opt$pre_cval,
120+
gbuild=gbuild, ndepthmax=opt$maxNDepth, unmatched=opt$unmatched)
121+
} else {
122+
if (gbuild %in% c("hg19", "hg18", "hg38"))
123+
nX <- 23
124+
if (gbuild %in% c("mm9", "mm10"))
125+
nX <- 20
126+
pmat <- facets:::procSnps(rcmat, ndepth=opt$minNDepth, het.thresh = 0.25, snp.nbhd = opt$snp_nbhd,
127+
gbuild=gbuild, unmatched=opt$unmatched, ndepthmax=opt$maxNDepth)
128+
dmat <- facets:::counts2logROR(pmat[pmat$rCountT > 0, ], gbuild, unmatched=opt$unmatched)
129+
dmat$keep[which(dmat$gcpct>=opt$maxGC | dmat$gcpct<=opt$minGC)] <- 0
130+
dmat <- dmat[dmat$keep == 1,]
131+
tmp1 <- facets:::segsnps(dmat, opt$pre_cval, hetscale=F)
132+
pmat$keep <- 0
133+
pmat$keep[which(paste(pmat$chrom, pmat$maploc, sep="_") %in% paste(dmat$chrom, dmat$maploc, sep="_"))] <- 1
134+
135+
tmp2 <- list(pmat = pmat, gbuild=gbuild, nX=nX)
136+
preOut <- c(tmp2,tmp1)
137+
}
138+
139+
formatSegmentOutput <- function(out,sampID) {
140+
seg=list()
141+
seg$ID=rep(sampID,nrow(out$out))
142+
seg$chrom=out$out$chr
143+
seg$loc.start=rep(NA,length(seg$ID))
144+
seg$loc.end=seg$loc.start
145+
seg$num.mark=out$out$num.mark
146+
seg$seg.mean=out$out$cnlr.median
147+
for(i in 1:nrow(out$out)) {
148+
lims=range(out$jointseg$maploc[(out$jointseg$chrom==out$out$chr[i] & out$jointseg$seg==out$out$seg[i])],na.rm=T)
149+
seg$loc.start[i]=lims[1]
150+
seg$loc.end[i]=lims[2]
151+
}
152+
as.data.frame(seg)
153+
}
154+
155+
out <- preOut %>% procSample(cval = opt$cval, min.nhet = opt$min_nhet)
156+
157+
cat ("Completed preProc and proc\n")
158+
cat ("procSample FLAG is", out$FLAG, "\n")
159+
160+
# save all objects except pileup
161+
save(file = str_c(opt$outPrefix, ".Rdata"), list = ls()[!grepl("^rcmat", ls())], compress=T)
162+
163+
# Run emncf, don't break if error:
164+
print(str_c("attempting to run emncf() with cval = ", opt$cval))
165+
fit <- tryCatch({
166+
out %>% emcncf
167+
}, error = function(e) {
168+
print(paste("Error:", e))
169+
return(NULL)
170+
})
171+
if (!is.null(fit)) {
172+
cat ("emcncf was successful with cval", opt$cval, "\n")
173+
174+
# make a table viewable in IGV
175+
out$IGV = formatSegmentOutput(out, opt$tumorName)
176+
177+
# plot facets results
178+
if(sum(out$out$num.mark)<=10000) { height=4; width=7} else { height=6; width=9}
179+
pdf(file = str_c(opt$outPrefix, ".cncf.pdf"), height = height, width = width)
180+
plotSample(out, fit)
181+
dev.off()
182+
183+
# save cncf table
184+
write.table(fit$cncf, str_c(opt$outPrefix, ".cncf.txt"), row.names = F, quote = F, sep = '\t')
185+
186+
# save results and metrics
187+
ff = str_c(opt$outPrefix, ".out")
188+
cat("# Version =", version, "\n", file = ff, append = T)
189+
cat("# Input =", basename(baseCountFile), "\n", file = ff, append = T)
190+
cat("# tumor =", opt$tumorName, "\n", file = ff, append = T)
191+
cat("# snp.nbhd =", opt$snp_nbhd, "\n", file = ff, append = T)
192+
cat("# cval =", opt$cval, "\n", file = ff, append = T)
193+
cat("# min.nhet =", opt$min_nhet, "\n", file = ff, append = T)
194+
cat("# genome =", opt$genome, "\n", file = ff, append = T)
195+
cat("# Purity =", fit$purity, "\n", file = ff, append = T)
196+
cat("# Ploidy =", fit$ploidy, "\n", file = ff, append = T)
197+
cat("# dipLogR =", fit$dipLogR, "\n", file = ff, append = T)
198+
cat("# dipt =", fit$dipt, "\n", file = ff, append = T)
199+
cat("# loglik =", fit$loglik, "\n", file = ff, append = T)
200+
201+
} else {
202+
cat ("emcncf failed with cval", opt$cval, "\n")
203+
fit <- NULL
204+
ff = str_c(opt$outPrefix, ".out")
205+
cat("# Version =", version, "\n", file = ff, append = T)
206+
cat("# Input =", basename(baseCountFile), "\n", file = ff, append = T)
207+
cat("# tumor =", opt$tumorName, "\n", file = ff, append = T)
208+
cat("# snp.nbhd =", opt$snp_nbhd, "\n", file = ff, append = T)
209+
cat("# cval =", opt$cval, "\n", file = ff, append = T)
210+
cat("# min.nhet =", opt$min_nhet, "\n", file = ff, append = T)
211+
cat("# genome =", opt$genome, "\n", file = ff, append = T)
212+
cat("# Purity =", "failed", "\n", file = ff, append = T)
213+
cat("# Ploidy =", "failed", "\n", file = ff, append = T)
214+
cat("# dipLogR =", "failed", "\n", file = ff, append = T)
215+
cat("# dipt =", "failed", "\n", file = ff, append = T)
216+
cat("# loglik =", "failed", "\n", file = ff, append = T)
217+
}
218+
219+
# save all objects except pileup
220+
save(file = str_c(opt$outPrefix, ".Rdata"), list = ls()[!grepl("^rcmat", ls())], compress=T)
221+
222+
223+
warnings()
224+

0 commit comments

Comments
 (0)