- Quick start
- Running cancer pre-processing pipeline
- Running the PDX preprocessing pipeline
- Running QC report generation
- Running concordance and contamination checking
- Additional steps for projects with contemporary normal
- Running the somatic variant calling pipeline
- Running sample and project report generation
- Delivering results
- Appendix
Run a complete project:
# Exome align
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla
# Exome calling and report writing
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline run \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla
# WGS alignment
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing
# WGS calling and report writing
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline run \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing
Almost all projects go through pre-processing pipeline in ER before we even hear about the projects. So for someone in bioinformatics to have to run this/request ER to run this pipeline is going to be very rare. However, if you do have to run the cancer pre-processing pipeline, you need sample FASTQ files.
It’s easiest to copy or link original FASTQ files to the <PROJECT_DIR>/Sample_<SAMPLE_NAME>/compbio/fastq directories. Software engineering do not prefer links, so if you want to run pre-processing through ER, do not use links.
# kancero preprocessing menu for v6 pipeline
usage: kancero6 pipeline pre-process [-h] [-p] [-s] [-g3] [--TN_file TN_FILE]
[--genome {Human_GRCh37,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh37_decoy,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}]
[--library {WGS,Exome}]
[--interval-list INTERVAL_LIST]
[--out_dir OUT_DIR] [--substr SUBSTR]
[--header {True,False}]
[--trim {True,False}] [--pdx PDX]
[--account account]
optional arguments:
-h, --help show this help message and exit
-p, --port Create a log to using when porting. Creates an md file
to edit for HTML or a gitlab page. Also writes a CWL
draft of commands [default=False]
-s, --spark Run GATK4 with Spark where possible
-g3, --gatk3 Run GATK3.5 with where possible
--TN_file TN_FILE
--genome {Human_GRCh37,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh37_decoy,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}
Genome key to use for pipeline. If not supplied
kancero will check the record of any previous kancero
runs in your out directory.
--library {WGS,Exome}
Sequence library type. If not supplied kancero will
check the record of any previous kancero runs in your
out directory.
--interval-list INTERVAL_LIST
File basename for interval list. If not supplied the
default (the SureSelect interval list for your genome)
will be used
--out_dir OUT_DIR Project directory for output files. The default is the
input project directory
--substr SUBSTR Substring indicating read one vs two. Must use 1 or 2
once to indicate read pair. Must also exist in the
last section of the filename after spliting the
filename at underscores.[default=.R1]
--header {True,False}
Use the header rather than filename for lane and
flowcell information [default="False"]
--trim {True,False} Trim adapters from FASTQ files "True" by default for
Exomes from Exome
--pdx PDX Remove mouse reads from FASTQ. Use for patient derived
xenographs (PDX) projects. [default=False]
--account account
Sets the --account flag for sbatch commands (e.g. compbio, dllab).
[default=compbio]
The manual pipeline runs bwa-mem, Novosort, GATK4 BQSR and fixmate, flagstat, etc. qc scripts.
The script expects FASTQ files in <PROJECT_DIR>/Sample_<SAMPLE_NAME>/compbio/fastq or <PROJECT_DIR>/Sample_<SAMPLE_NAME>/fastq directory. It also expects the FASTQ filenames to follow the following naming convention:
<SAMPLE_NAME>_<INDEX>_<FLOWCELL>_<LANE>_*.R?.fastq.gz
Example:
CTG-0435-D_AGCACCTC-_BC7JV4ANXX_L005_001.R1.fastq.gz or CTG-0435-D_AGCACCTC-_BC7JV4ANXX_L005_001.filtered.R1.fastq.gz
If you do not have this structure you can use the --header to attempt to gather the needed information from the read header lines (e.g. when reprocessing TCGA data). Use the --substr flag to indicate a different read suffix. See the pipeline pre-process help menu for more information.
# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla
# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing
The pipeline first filters by aligning with bwa-aln to a joint Human/mouse reference. Subsequently, it extracts all read pairs from the bam that either don’t map or for which at least one of the mates maps against human. Then it aligns filtered reads with bwa-mem, Novosort, GATK4 BQSR and fixmate, flagstat, etc. qc scripts. The pipeline does not run: mark duplicates, bqsr and clipping. Filtered reads have mmu_filtered added to the fastq filename.
# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla \
--pdx True
# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--pdx True
This automatically runs after pre-processing. You only need this for projects that do not work with outpist. You only need to run if pre-processing was incomplete or you are starting with an external BAM. Run (can be WGS or Exome):
usage: kancero6 pipeline create-qc-reports [-h] [--TN_file TN_FILE]
[--library {WGS,Exome}]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}]
--project_name PROJECT_NAME
[--account account]
[--out_dir OUT_DIR]
[--no-autocorrelation {True,False}]
optional arguments:
-h, --help show this help message and exit
--TN_file TN_FILE
--library {WGS,Exome}
Sequence library type. If not supplied kancero will
check the record of any previous kancero runs in your
out directory.
--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}
Genome key to use for pipeline. If not supplied
kancero will check the record of any previous kancero
runs in your out directory.
--project_name PROJECT_NAME
Project name for report text
--account account
Sets the --account flag for sbatch commands (e.g. compbio, dllab).
[default=compbio]
--out_dir OUT_DIR Project directory for output files. The default is the
input project directory
--no-autocorrelation {True,False}
Project has no autocorrelation file (e.g. mouse runs)
(default='False')
# Example:
kancero/kancero6 \
--project /gpfs/commons/projects/MY_PROJECT \
pipeline create-qc-reports \
--TN_file /gpfs/commons/projects/MY_PROJECT/tumor_normal_pairs.txt \
--project_name MY_PROJECT \
--q_project compbio \
--out_dir /gpfs/commons/projects/MY_PROJECT \
--library WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla
Results will be generated in /gpfs/commons/projects/MY_PROJECT/compbio/Reports
*Note: You will usually view QC plots in http://outpost.nygenome.org/
- ConPair Summary Report after EastRiver run
- Running ConPair through kancero
- Running ConPair for Cancer Alliance samples
- Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples
ConPair is in EastRiver. For a tumor-normal pair, the concordance file currently is <PROJECT_DIR>/Sample_<tumor>/qc/<tumor>--<normal>.concordance.homoz.conpair-20160318.01.txt and contamination file is <PROJECT_DIR>/Sample_<tumor>/qc/<tumor>--<normal>.contamination.conpair-20160318.01.txt
To generate concordance summary, run:
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair summary \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla
The concordance summary PDF will be saved in PROJECT_DIR/compbio/Summary directory and will be called Concordance_<PROJECT_NAME>.pdf
Requires a T/N metadata file
usage: kancero6 conpair project [-h] [--out_dir OUT_DIR]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
[--TN_file TN_FILE]
[--account account]
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair project \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla
For a tumor-normal pair, the concordance file currently is <PROJECT_DIR>/Sample_<tumor>/compbio/qc/<tumor>--<normal>.concordance.homoz.conpair-v1.0.txt and contamination file is
<PROJECT_DIR>/Sample_<tumor>/compbio/qc/<tumor>--<normal>.contamination.conpair-v1.0.txt
To generate concordance summary, run:
usage: kancero6 conpair summary [-h] [--account account]
[--out_dir OUT_DIR]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
[--TN_file TN_FILE]
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair summary \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla
The concordance summary PDF will be saved in PROJECT_DIR/compbio/Summary directory and will be called Concordance_<PROJECT_NAME>.pdf
Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples
usage: kancero6 conpair sample [-h]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
--sample SAMPLE --sample2 SAMPLE2
--sample_library {WGS,Exome,RNA}
--sample2_library {WGS,Exome,RNA}
[--out_dir OUT_DIR] [--rna_genome RNA_GENOME]
[--account account]
[--concordance_only]
Use --other_projects to specify the additional project directories to search for BAM files.
kancero/kancero6 \
--other_projects /data/analysis/Project_WGS_test_kancero \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair sample \
--genome Human_GRCh37 \
--sample CA-0073T-D-W \
--sample_library WGS \
--sample2 CA-0073T-R \
--sample2_library RNA \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS
Which by default runs: sample-sample2 concordance sample-sample2 contamination
You must specify exactly two samples:
--sample will be treated as ’tumor' and --sample2 will be treated as 'normal'.
kancero/kancero6 \
--other_projects /data/analysis/Project_WGS_test_kancero \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair sample \
--genome Human_GRCh37 \
--sample CA-0073T-D-W \
--sample_library WGS \
--sample2 CA-0073N-D-W \
--sample2_library WGS \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS
If the project has tumor-only samples and a contemporary normal has not been sequenced along with the samples, create a directory for the contemporary normal and link bam file and qc files from the appropriate contemporary normal sample in /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals
For example if it is a human WGS project, and tumor was sequenced on HiSeqX v3 PCR-free, then do the following:
mkdir -p <PROJECT_DIR>/Sample_NA12878/analysis
ln -s /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals/wgs/Xten/v3/pcr_free/Sample_NA12878/analysis/NA12878.final.ba* <PROJECT_DIR>/Sample_NA12878/analysis/.
mkdir <PROJECT_DIR>/Sample_NA12878/qc
cp /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals/wgs/Xten/v3/pcr_free/Sample_NA12878/qc/*.* <PROJECT_DIR>/Sample_NA12878/qc/.
Requires a T/N metadata file for example: PROJECT_DIR/compbio/metadata/tumor_normal_pairs.txt
*Note: Most pipelines are run on EastRiver(ER). However ER doesn’t generate sample reports and project summary. Those will have to be generated using kancero (see next section)
To run the entire pipeline (i.e variant calling, merging, annotation and report generations).
The new pipeline includes an --out_dir flag and can redirect output to a different location.
By default out dir will be set to the project dir.
usage: kancero6 pipeline run [-h]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}]
[--TN_file TN_FILE] [--library {WGS,Exome}]
[--interval-list INTERVAL_LIST] [-p]
[--account account] [--out_dir OUT_DIR]
[--run_steps [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} ...]]]
[--post_run_steps [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} ...]]]
[--report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]]
[--meta META] [--severe]
optional arguments:
-h, --help show this help message and exit
--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}
Genome key to use for pipeline. If not supplied
kancero will check the record of any previous kancero
runs in your out directory.
--TN_file TN_FILE
--library {WGS,Exome}
Sequence library type. If not supplied kancero will
check the record of any previous kancero runs in your
out directory.
--interval-list INTERVAL_LIST
File basename for interval list. If not supplied the
default (the SureSelect interval list for your genome)
will be used
-p, --port Create a log to using when porting. Creates an md file
to edit for HTML or a gitlab page. Also writes a CWL
draft of commands [default=False]
--account account
Sets the --account flag for sbatch commands (e.g. compbio, dllab).
[default=compbio]
--out_dir OUT_DIR Project directory for output files. The default is the
input project directory
--run_steps [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} ...]]
Callers to run. Leave blank space beside flag to skip
all steps. Default is to run all steps except
lancet_wgs and lancet_post_wgs.
--post_run_steps [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} ...]]
Final steps to run.Leave blank space beside flag to
skip all steps. Default is to run all steps.
--report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]
Report steps to run.Leave blank space beside flag to
skip all steps. Default is to run all steps.
--meta META CSV file with a header line.Header line should start
with "pair_name" and go on to list any other groups
that can be used to classify the pairs.
#Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline run \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
--library Exome \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla
# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline run \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS
Requires a tumor normal pairs file.
usage: kancero6 pipeline create-reports [-h]
[--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}]
[--TN_file TN_FILE]
[--library {WGS,Exome}] --project_name
PROJECT_NAME
[--account account]
[--out_dir OUT_DIR] [--severe]
[--meta META]
[--report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]]
# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline create-reports \
--project_name somatic_GRCh38_WGS \
--TN_file /data/analysis/NYGC/Project_preprocess_somatic_GRCh38/metadata/tumor_normal_pairs.txt \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--library WGS \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS
# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline create-reports \
--project_name somatic_GRCh38_Exome \
--TN_file /data/analysis/NYGC/Project_preprocess_somatic_GRCh38/metadata/tumor_normal_pairs.txt \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--library Exome \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome
Requires a T/N metadata file for example: PROJECT_DIR/compbio/metadata/tumor_normal_pairs.txt
Currently manual process
This file contains information about each pair in a project. It is currently used for figures in reports. It is a CSV file and the first line is the header (the header is required). The first column should always be pair_name. The content of this column should be TUMOR--NORMAL. The content in the next columns can include be integers, strings or floats. Leave cells with no information blank. An example is shown below and can also be found at /gpfs/commons/groups/nygcfaculty/kancero/data/examples/subject_meta.txt. This file can be used with the --meta flag.
pair_name,patient,platform,depth
COLO-829--COLO-829BL,COLO-829,hiseq,high
COLO-829_80--COLO-829BL_40,COLO-829,hiseq,downsample
COLO-829-NovaSeq--COLO-829BL-NovaSeq,COLO-829,novaseq,high
COLO-829-NovaSeq_80--COLO-829BL-NovaSeq_40,COLO-829,novaseq,downsample
Make sure $project/compbio/metadata/tumor_normal_pairs.txt contain correct pairs:
Let’s say we have three tumor--normal pairs, T1--N1, T2--N2, and T3--N3, then the file should look like this (header line optional but must start with a # if included):
#Tumor normal
T1 N1
T2 N2
T3 N3
Lines with ‘#’ are ignored, so if there are additional pairs that aren’t ready yet, they can still be in the file, but need to commented out with ‘#’
Before publishing (“pushing”) a new version of kancero to the master branch we may ask you to test our updates using our “staging” branch. Please provide any suggestions or feedback on these changes so that we can make improvements before updating the master branch that we all use.
# To get a local copy of the staging branch:
git checkout master # start on the master branch
pit pull # pull any updates
git fetch origin # the next two commands allow you to get a list of available branches
git branch -v -a
git checkout -b staging origin/staging # checkout a copy of the staging branch
# Now you can run the newest version of the pipeline scripts
git checkout master # returns you to the master branch copy of scripts
git pull # gets you an updated copy of the master branch once we have pushed the new update out
# To return to your local staging branch:
git checkout staging
git pull
# Now you can run the newest version of the pipeline scripts
git checkout master # returns you to the master branch copy of script
git pull # gets you an updated copy of the master branch once we have pushed the new update out
# If you are making a change in the kancero code or documentation please follow the steps outlined in the “Kancero dev workflow” doc https://docs.google.com/document/d/1qDdChcboXlF6PIA70qODTbvXMuhbndESUBP7dlWlOL8/edit
Kancero can be used to plot run time, wait time, core hours, exit status and memory usage. The required input is a file named <project>/compbio/logs/<anything>_job_ids.txt. This file is automatically created by new kancero steps. You can also create a file of your own to use. Each line of the file should contain a job id and number of component parts represented by that one job id. The number of component parts is typically the number of samples that are processed under that job id because you may want to know how many cores hours are used per sample. The job_ids.txt file should be a comma separated CSV file.
Usage automatically runs after kancero steps so you should check this file to see that all exit status values are 0 after any run.
If you want to run usage manually on any file you have put in a logs directory do the following:
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline usage
The output plots and tables will have the same prefix as the input job_ids.txt file.
Kancero can now read input from multiple directories. It will first search for the input BAM or file in the directory for --project and second it will search in the additional directories indicated by the --other_projects flag. Steps will search in EastRiver and then in compbio output locations. If it still can't find the file it will post a warning and begin recursively searching all given project directories for the filename. Note: because the filename of Fastq files is unknown the recusive search can't be used for pre-processing. For pre-processing all raw FASTQ files (e.g. without trimmed or mmu_filtered) in the EastRiver(/PROJECT_DIR/Sample_SAMPLE/fastq) AND the compbio (/PROJECT_DIR/Sample_SAMPLE/compbio/fastq) directories AND in matching sample directories in additional project directories (/ALT_PROJECT_DIR/Sample_SAMPLE/fastq) will be used.
Below is an example of a command with multiple directories:
kancero/kancero6 \
--other_projects /data/analysis/CLIA/Project_current_B01_SOM_WGS \
/data/analysis/CLIA/Project_current_B02_SOM_WGS \
/data/analysis/CLIA/Project_current_B03_SOM_WGS \
/data/analysis/DarnellR/Project_current_B04_WGS \
--project /gpfs/internal/analysis/CLIA/Project_current_B06_SOM_WGS \
pipeline run \
--TN_file /gpfs/commons/home/jshelton/CLIV_10800_B01_SOM_WGS_tumor_normal_pairs.txt \
--genome Human_GRCh37 \
--library WGS \
--out /gpfs/internal/analysis/CLIA/Project_current_B06_SOM_WGS
To get a list of genomes:
kancero/kancero6 \
pipeline genomes
When processing outside FASTQ files (e.g. from TCGA) you can get kancero to read run information from several kinds of illumnia headers rather than the filename if needed. Kancero will prompt you if this process fails to find the needed information in the headers. You would then be instructed on how to re-write the headers.
kancero/kancero6 \
--project /gpfs/test/Project_GRCh38_decoy_v6 \
pipeline pre-process \
--library WGS \
--TN_file /gpfs/test/Project_GRCh38_decoy_v6/compbio/metadata/tumor_normal_pairs.txt \
--header True
All new features of the somatic preprocessing and calling pipeline support the new directory structure. In the case of alignments, they read FASTQs from both EastRiver and compbio FASTQ directories. In the case of steps that start from BAMs they read from the first existing final.bam in EastRiver and then compbio FASTQ dirs. All new pipeline steps will also check for you in other directories using the --other_projects flag (see Use on partly delivered directories).
Below is a list of new features:
- DONE: pre-process v2
- DONE: somatic calling v6
- DONE: pipeline resource usage summary
Some older kancero features for the old preprocessing and v5B pipeline will not be updated for the new directory structure. They can still be run by making your own directory in /gpfs/commons/groups/compbio/projects/Project_PROJECT/.
Below is a list of older features:
- DONE: Slide deck script
- DONE: Conpair
- DONE: Report generation for v5B projects
- DO NOT UPDATE: pre-process v1
- DO NOT UPDATE: somatic calling v5B
- DO NOT UPDATE: Summary generation for v5B projects
- DO NOT UPDATE: QC generation for v5B projects
Additional files or directories can still be created by the user in any compbio subdirectory. Kancero often requires a T/N metadata file use --TN_file flag to point to a file located anywhere.