sampleMANUAL

Quick start
Running cancer pre-processing pipeline
- Running the PDX preprocessing pipeline
Running QC report generation
Running concordance and contamination checking
Additional steps for projects with contemporary normal
Running the somatic variant calling pipeline
Running sample and project report generation
Delivering results
Appendix

1.Quick start

Run a complete project:

# Exome align
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla
# Exome calling and report writing
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome  \
pipeline run \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla

# WGS alignment
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing
# WGS calling and report writing
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS  \
pipeline run \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing

2.Cancer pre-processing pipeline

Almost all projects go through pre-processing pipeline in ER before we even hear about the projects. So for someone in bioinformatics to have to run this/request ER to run this pipeline is going to be very rare. However, if you do have to run the cancer pre-processing pipeline, you need sample FASTQ files.

It’s easiest to copy or link original FASTQ files to the <PROJECT_DIR>/Sample_<SAMPLE_NAME>/compbio/fastq directories. Software engineering do not prefer links, so if you want to run pre-processing through ER, do not use links.

# kancero preprocessing menu for v6 pipeline
usage: kancero6 pipeline pre-process [-h] [-p] [-s] [-g3] [--TN_file TN_FILE]
        [--genome {Human_GRCh37,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh37_decoy,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}]
        [--library {WGS,Exome}]
        [--interval-list INTERVAL_LIST]
        [--out_dir OUT_DIR] [--substr SUBSTR]
        [--header {True,False}]
        [--trim {True,False}] [--pdx PDX]
        [--account account]

optional arguments:
  -h, --help            show this help message and exit
  -p, --port            Create a log to using when porting. Creates an md file
                        to edit for HTML or a gitlab page. Also writes a CWL
                        draft of commands [default=False]
  -s, --spark           Run GATK4 with Spark where possible
  -g3, --gatk3          Run GATK3.5 with where possible
  --TN_file TN_FILE
  --genome {Human_GRCh37,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh37_decoy,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}
                        Genome key to use for pipeline. If not supplied
                        kancero will check the record of any previous kancero
                        runs in your out directory.
  --library {WGS,Exome}
                        Sequence library type. If not supplied kancero will
                        check the record of any previous kancero runs in your
                        out directory.
  --interval-list INTERVAL_LIST
                        File basename for interval list. If not supplied the
                        default (the SureSelect interval list for your genome)
                        will be used
  --out_dir OUT_DIR     Project directory for output files. The default is the
                        input project directory
  --substr SUBSTR       Substring indicating read one vs two. Must use 1 or 2
                        once to indicate read pair. Must also exist in the
                        last section of the filename after spliting the
                        filename at underscores.[default=.R1]
  --header {True,False}
                        Use the header rather than filename for lane and
                        flowcell information [default="False"]
  --trim {True,False}   Trim adapters from FASTQ files "True" by default for
                        Exomes from Exome
  --pdx PDX             Remove mouse reads from FASTQ. Use for patient derived
                        xenographs (PDX) projects. [default=False]
  --account account
                        Sets the --account flag for sbatch commands (e.g. compbio, dllab).
                        [default=compbio]

Examples

Kancero for v6 (WGS and Exome)
Kancero for v6 PDX preprocessing pipeline

Kancero for v6 (WGS and Exome)

The manual pipeline runs bwa-mem, Novosort, GATK4 BQSR and fixmate, flagstat, etc. qc scripts.

The script expects FASTQ files in <PROJECT_DIR>/Sample_<SAMPLE_NAME>/compbio/fastq or <PROJECT_DIR>/Sample_<SAMPLE_NAME>/fastq directory. It also expects the FASTQ filenames to follow the following naming convention:

<SAMPLE_NAME>_<INDEX>_<FLOWCELL>_<LANE>_*.R?.fastq.gz

Example:

CTG-0435-D_AGCACCTC-_BC7JV4ANXX_L005_001.R1.fastq.gz or CTG-0435-D_AGCACCTC-_BC7JV4ANXX_L005_001.filtered.R1.fastq.gz

If you do not have this structure you can use the --header to attempt to gather the needed information from the read header lines (e.g. when reprocessing TCGA data). Use the --substr flag to indicate a different read suffix. See the pipeline pre-process help menu for more information.

# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla

# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing

Kancero PDX preprocessing pipeline for v6

The pipeline first filters by aligning with bwa-aln to a joint Human/mouse reference. Subsequently, it extracts all read pairs from the bam that either don’t map or for which at least one of the mates maps against human. Then it aligns filtered reads with bwa-mem, Novosort, GATK4 BQSR and fixmate, flagstat, etc. qc scripts. The pipeline does not run: mark duplicates, bqsr and clipping. Filtered reads have mmu_filtered added to the fastq filename.

# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--trim True \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla \
--pdx True

# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS  \
pipeline pre-process \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library Exome \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome_testing \
--pdx True

3.Running QC report generation

This automatically runs after pre-processing. You only need this for projects that do not work with outpist. You only need to run if pre-processing was incomplete or you are starting with an external BAM. Run (can be WGS or Exome):

usage: kancero6 pipeline create-qc-reports [-h] [--TN_file TN_FILE]
                                           [--library {WGS,Exome}]
                                           [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}]
                                           --project_name PROJECT_NAME
                                           [--account account]
                                           [--out_dir OUT_DIR]
                                           [--no-autocorrelation {True,False}]

optional arguments:
  -h, --help            show this help message and exit
  --TN_file TN_FILE
  --library {WGS,Exome}
                        Sequence library type. If not supplied kancero will
                        check the record of any previous kancero runs in your
                        out directory.
  --genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10,Rat_Rnor_6_0}
                        Genome key to use for pipeline. If not supplied
                        kancero will check the record of any previous kancero
                        runs in your out directory.
  --project_name PROJECT_NAME
                        Project name for report text
  --account account
                        Sets the --account flag for sbatch commands (e.g. compbio, dllab).
                        [default=compbio]
  --out_dir OUT_DIR     Project directory for output files. The default is the
                        input project directory
  --no-autocorrelation {True,False}
                        Project has no autocorrelation file (e.g. mouse runs)
                        (default='False')

# Example:

kancero/kancero6 \
--project /gpfs/commons/projects/MY_PROJECT \
pipeline create-qc-reports \
--TN_file /gpfs/commons/projects/MY_PROJECT/tumor_normal_pairs.txt \
--project_name MY_PROJECT \
--q_project compbio \
--out_dir /gpfs/commons/projects/MY_PROJECT \
--library WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla

Results will be generated in /gpfs/commons/projects/MY_PROJECT/compbio/Reports

*Note: You will usually view QC plots in http://outpost.nygenome.org/

4.Concordance and contamination checking

ConPair Summary Report after EastRiver run
Running ConPair through kancero
Running ConPair for Cancer Alliance samples
Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples

ConPair summart after EastRiver

ConPair is in EastRiver. For a tumor-normal pair, the concordance file currently is <PROJECT_DIR>/Sample_<tumor>/qc/<tumor>--<normal>.concordance.homoz.conpair-20160318.01.txt and contamination file is <PROJECT_DIR>/Sample_<tumor>/qc/<tumor>--<normal>.contamination.conpair-20160318.01.txt

To generate concordance summary, run:

kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair summary \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla

The concordance summary PDF will be saved in PROJECT_DIR/compbio/Summary directory and will be called Concordance_<PROJECT_NAME>.pdf

Running ConPair through kancero

Requires a T/N metadata file

usage: kancero6 conpair project [-h] [--out_dir OUT_DIR]
                                [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
                                [--TN_file TN_FILE]
                                [--account account]

kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair project \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla

For a tumor-normal pair, the concordance file currently is <PROJECT_DIR>/Sample_<tumor>/compbio/qc/<tumor>--<normal>.concordance.homoz.conpair-v1.0.txt and contamination file is <PROJECT_DIR>/Sample_<tumor>/compbio/qc/<tumor>--<normal>.contamination.conpair-v1.0.txt

To generate concordance summary, run:

usage: kancero6 conpair summary [-h] [--account account]
                                [--out_dir OUT_DIR]
                                [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
                                [--TN_file TN_FILE]

kancero/kancero6 \ 
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair summary \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/metadata/tumor_normal_pairs.txt \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla

The concordance summary PDF will be saved in PROJECT_DIR/compbio/Summary directory and will be called Concordance_<PROJECT_NAME>.pdf

Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples

usage: kancero6 conpair sample [-h]
                               [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty}]
                               --sample SAMPLE --sample2 SAMPLE2
                               --sample_library {WGS,Exome,RNA}
                               --sample2_library {WGS,Exome,RNA}
                               [--out_dir OUT_DIR] [--rna_genome RNA_GENOME]
                               [--account account]
                               [--concordance_only]

Use --other_projects to specify the additional project directories to search for BAM files.

kancero/kancero6 \
--other_projects /data/analysis/Project_WGS_test_kancero \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair sample \
--genome Human_GRCh37 \
--sample CA-0073T-D-W \
--sample_library WGS \
--sample2 CA-0073T-R \
--sample2_library RNA \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS

Which by default runs: sample-sample2 concordance sample-sample2 contamination

You must specify exactly two samples:

--sample will be treated as ’tumor' and --sample2 will be treated as 'normal'.

kancero/kancero6 \
--other_projects /data/analysis/Project_WGS_test_kancero \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
conpair sample \
--genome Human_GRCh37 \
--sample CA-0073T-D-W \
--sample_library WGS \
--sample2 CA-0073N-D-W \
--sample2_library WGS \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS

5.Additional steps for projects with contemporary normal

If the project has tumor-only samples and a contemporary normal has not been sequenced along with the samples, create a directory for the contemporary normal and link bam file and qc files from the appropriate contemporary normal sample in /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals

For example if it is a human WGS project, and tumor was sequenced on HiSeqX v3 PCR-free, then do the following:

mkdir -p <PROJECT_DIR>/Sample_NA12878/analysis
ln -s /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals/wgs/Xten/v3/pcr_free/Sample_NA12878/analysis/NA12878.final.ba* <PROJECT_DIR>/Sample_NA12878/analysis/.
mkdir <PROJECT_DIR>/Sample_NA12878/qc
cp /gpfs/commons/datasets/old-nygc-resources/Somatic_Pipelines/contemporary_normals/wgs/Xten/v3/pcr_free/Sample_NA12878/qc/*.* <PROJECT_DIR>/Sample_NA12878/qc/.

6.Running the somatic calling pipeline

Requires a T/N metadata file for example: PROJECT_DIR/compbio/metadata/tumor_normal_pairs.txt

*Note: Most pipelines are run on EastRiver(ER). However ER doesn’t generate sample reports and project summary. Those will have to be generated using kancero (see next section)

To run the entire pipeline (i.e variant calling, merging, annotation and report generations). The new pipeline includes an --out_dir flag and can redirect output to a different location. By default out dir will be set to the project dir.

usage: kancero6 pipeline run [-h]
                             [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}]
                             [--TN_file TN_FILE] [--library {WGS,Exome}]
                             [--interval-list INTERVAL_LIST] [-p]
                             [--account account] [--out_dir OUT_DIR]
                             [--run_steps [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} ...]]]
                             [--post_run_steps [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} ...]]]
                             [--report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]]
                             [--meta META] [--severe]

optional arguments:
  -h, --help            show this help message and exit
  --genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}
                        Genome key to use for pipeline. If not supplied
                        kancero will check the record of any previous kancero
                        runs in your out directory.
  --TN_file TN_FILE
  --library {WGS,Exome}
                        Sequence library type. If not supplied kancero will
                        check the record of any previous kancero runs in your
                        out directory.
  --interval-list INTERVAL_LIST
                        File basename for interval list. If not supplied the
                        default (the SureSelect interval list for your genome)
                        will be used
  -p, --port            Create a log to using when porting. Creates an md file
                        to edit for HTML or a gitlab page. Also writes a CWL
                        draft of commands [default=False]
  --account account
                        Sets the --account flag for sbatch commands (e.g. compbio, dllab).
                        [default=compbio]
  --out_dir OUT_DIR     Project directory for output files. The default is the
                        input project directory
  --run_steps [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} [{strelka2,lancet,lancet_post,manta,facets,excavator2,lancet_wgs,lancet_post_wgs,svaba,mutect2,mutect2_post,mantis,lumpy,svtyper,lumpy_post,svtyper_filter,bicseq2,optitype,kourami,haplotypecaller,haplotypecaller_post,baf,annotate_hap} ...]]
                        Callers to run. Leave blank space beside flag to skip
                        all steps. Default is to run all steps except
                        lancet_wgs and lancet_post_wgs.
  --post_run_steps [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} [{prep,merge_callers,merge_chroms,annotate,deconstructsig,annotate_sv_cnv,deliver} ...]]
                        Final steps to run.Leave blank space beside flag to
                        skip all steps. Default is to run all steps.
  --report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]
                        Report steps to run.Leave blank space beside flag to
                        skip all steps. Default is to run all steps.
  --meta META           CSV file with a header line.Header line should start
                        with "pair_name" and go on to list any other groups
                        that can be used to classify the pairs.

#Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \ 
pipeline run \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome/compbio/metadata/tumor_normal_pairs.txt \
--account compbio \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
--library Exome \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--interval-list SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla

# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline run \
--TN_file /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS/compbio/metadata/tumor_normal_pairs.txt \
--library WGS \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS

7.Running sample and project report generation

Requires a tumor normal pairs file.

usage: kancero6 pipeline create-reports [-h]
                                        [--genome {Human_GRCh37,Human_GRCh37_decoy,Human_GRCh37_external,Human_GRCh38_external,Human_GRCh38_full_analysis_set_plus_decoy_hla,Human_GRCh38_full_analysis_set_plus_decoy_hla_faculty,Mouse_GRCm38_mm10}]
                                        [--TN_file TN_FILE]
                                        [--library {WGS,Exome}] --project_name
                                        PROJECT_NAME
                                        [--account account]
                                        [--out_dir OUT_DIR] [--severe]
                                        [--meta META]
                                        [--report_run_steps [{create_reports,create_project_level} [{create_reports,create_project_level} ...]]]

# WGS
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS \
pipeline create-reports \
--project_name somatic_GRCh38_WGS \
--TN_file /data/analysis/NYGC/Project_preprocess_somatic_GRCh38/metadata/tumor_normal_pairs.txt \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--library WGS \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS

# Exome
kancero/kancero6 \
--project /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome \
pipeline create-reports \
--project_name somatic_GRCh38_Exome \
--TN_file /data/analysis/NYGC/Project_preprocess_somatic_GRCh38/metadata/tumor_normal_pairs.txt \
--genome Human_GRCh38_full_analysis_set_plus_decoy_hla \
--library Exome \
--out_dir /nfs/sweng/validation/data/analysis/NYGC/myproject_Exome

8.Delivering results through kancero

Requires a T/N metadata file for example: PROJECT_DIR/compbio/metadata/tumor_normal_pairs.txt

Currently manual process

Appendix

A. File format : subject metadata file

This file contains information about each pair in a project. It is currently used for figures in reports. It is a CSV file and the first line is the header (the header is required). The first column should always be pair_name. The content of this column should be TUMOR--NORMAL. The content in the next columns can include be integers, strings or floats. Leave cells with no information blank. An example is shown below and can also be found at /gpfs/commons/groups/nygcfaculty/kancero/data/examples/subject_meta.txt. This file can be used with the --meta flag.

pair_name,patient,platform,depth
COLO-829--COLO-829BL,COLO-829,hiseq,high
COLO-829_80--COLO-829BL_40,COLO-829,hiseq,downsample
COLO-829-NovaSeq--COLO-829BL-NovaSeq,COLO-829,novaseq,high
COLO-829-NovaSeq_80--COLO-829BL-NovaSeq_40,COLO-829,novaseq,downsample

B. File format : tumor normal pairs file

Make sure $project/compbio/metadata/tumor_normal_pairs.txt contain correct pairs:

Let’s say we have three tumor--normal pairs, T1--N1, T2--N2, and T3--N3, then the file should look like this (header line optional but must start with a # if included):

#Tumor	normal
T1	N1
T2	N2
T3	N3

Lines with ‘#’ are ignored, so if there are additional pairs that aren’t ready yet, they can still be in the file, but need to commented out with ‘#’

C. Github best practices

Before publishing (“pushing”) a new version of kancero to the master branch we may ask you to test our updates using our “staging” branch. Please provide any suggestions or feedback on these changes so that we can make improvements before updating the master branch that we all use.

# To get a local copy of the staging branch:
git checkout master # start on the master branch
pit pull # pull any updates
git fetch origin # the next two commands allow you to get a list of available branches
git branch -v -a
git checkout -b staging origin/staging # checkout a copy of the staging branch
# Now you can run the newest version of the pipeline scripts 
git checkout master # returns you to the master branch copy of scripts
git pull  # gets you an updated copy of the master branch once we have pushed the new update out 
# To return to your local staging branch:
git checkout staging
git pull
# Now you can run the newest version of the pipeline scripts
git checkout master # returns you to the master branch copy of script
git pull # gets you an updated copy of the master branch once we have pushed the new update out 
# If you are making a change in the kancero code or documentation please follow the steps outlined in the “Kancero dev workflow” doc https://docs.google.com/document/d/1qDdChcboXlF6PIA70qODTbvXMuhbndESUBP7dlWlOL8/edit

D. Use Kancero to plot cluster usage

Kancero can be used to plot run time, wait time, core hours, exit status and memory usage. The required input is a file named <project>/compbio/logs/<anything>_job_ids.txt. This file is automatically created by new kancero steps. You can also create a file of your own to use. Each line of the file should contain a job id and number of component parts represented by that one job id. The number of component parts is typically the number of samples that are processed under that job id because you may want to know how many cores hours are used per sample. The job_ids.txt file should be a comma separated CSV file.

Usage automatically runs after kancero steps so you should check this file to see that all exit status values are 0 after any run.

If you want to run usage manually on any file you have put in a logs directory do the following:

kancero/kancero6 \
--project  /nfs/sweng/validation/data/analysis/NYGC/myproject_WGS  \
pipeline usage

The output plots and tables will have the same prefix as the input job_ids.txt file.

E. Use on partly delivered directories

Kancero can now read input from multiple directories. It will first search for the input BAM or file in the directory for --project and second it will search in the additional directories indicated by the --other_projects flag. Steps will search in EastRiver and then in compbio output locations. If it still can't find the file it will post a warning and begin recursively searching all given project directories for the filename. Note: because the filename of Fastq files is unknown the recusive search can't be used for pre-processing. For pre-processing all raw FASTQ files (e.g. without trimmed or mmu_filtered) in the EastRiver(/PROJECT_DIR/Sample_SAMPLE/fastq) AND the compbio (/PROJECT_DIR/Sample_SAMPLE/compbio/fastq) directories AND in matching sample directories in additional project directories (/ALT_PROJECT_DIR/Sample_SAMPLE/fastq) will be used.

Below is an example of a command with multiple directories:

kancero/kancero6 \
--other_projects /data/analysis/CLIA/Project_current_B01_SOM_WGS \
/data/analysis/CLIA/Project_current_B02_SOM_WGS \
/data/analysis/CLIA/Project_current_B03_SOM_WGS \
/data/analysis/DarnellR/Project_current_B04_WGS \
--project /gpfs/internal/analysis/CLIA/Project_current_B06_SOM_WGS  \
pipeline run \
--TN_file /gpfs/commons/home/jshelton/CLIV_10800_B01_SOM_WGS_tumor_normal_pairs.txt \
--genome Human_GRCh37 \
--library WGS \
--out /gpfs/internal/analysis/CLIA/Project_current_B06_SOM_WGS

F. Use Kancero to list available resources

To get a list of genomes:

kancero/kancero6 \
pipeline genomes

G. Preprocess outside FASTQs

When processing outside FASTQ files (e.g. from TCGA) you can get kancero to read run information from several kinds of illumnia headers rather than the filename if needed. Kancero will prompt you if this process fails to find the needed information in the headers. You would then be instructed on how to re-write the headers.

kancero/kancero6 \
--project /gpfs/test/Project_GRCh38_decoy_v6 \
pipeline pre-process \
--library WGS \
--TN_file /gpfs/test/Project_GRCh38_decoy_v6/compbio/metadata/tumor_normal_pairs.txt \
--header True

H. Kancero and the new directory structure

All new features of the somatic preprocessing and calling pipeline support the new directory structure. In the case of alignments, they read FASTQs from both EastRiver and compbio FASTQ directories. In the case of steps that start from BAMs they read from the first existing final.bam in EastRiver and then compbio FASTQ dirs. All new pipeline steps will also check for you in other directories using the --other_projects flag (see Use on partly delivered directories).

Below is a list of new features:

DONE: pre-process v2
DONE: somatic calling v6
DONE: pipeline resource usage summary

Some older kancero features for the old preprocessing and v5B pipeline will not be updated for the new directory structure. They can still be run by making your own directory in /gpfs/commons/groups/compbio/projects/Project_PROJECT/.

Below is a list of older features:

DONE: Slide deck script
DONE: Conpair
DONE: Report generation for v5B projects
DO NOT UPDATE: pre-process v1
DO NOT UPDATE: somatic calling v5B
DO NOT UPDATE: Summary generation for v5B projects
DO NOT UPDATE: QC generation for v5B projects

Additional files or directories can still be created by the user in any compbio subdirectory. Kancero often requires a T/N metadata file use --TN_file flag to point to a file located anywhere.

Name		Name	Last commit message	Last commit date
Latest commit History 1,380 Commits
Conpair		Conpair
annotation_bedpe_v6		annotation_bedpe_v6
binest		binest
callers		callers
config		config
data		data
germ		germ
ichor		ichor
mutational_sig		mutational_sig
plotting		plotting
postprocess		postprocess
qc		qc
reports		reports
third-party		third-party
unit_test		unit_test
util		util
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
.gitignore		.gitignore
Class.py		Class.py
Conpair.py		Conpair.py
Contamination.py		Contamination.py
DNA_qc.py		DNA_qc.py
Deliver.py		Deliver.py
Pipeline_calling.py		Pipeline_calling.py
Pipeline_plus.py		Pipeline_plus.py
Post_process.py		Post_process.py
Pre_process.py		Pre_process.py
QC_report.py		QC_report.py
QueueIO.py		QueueIO.py
README.md		README.md
Report.py		Report.py
Request.py		Request.py
SlideDeck.py		SlideDeck.py
Usage.py		Usage.py
kancero6		kancero6
original_manual_README.md		original_manual_README.md
pipe.py		pipe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sampleMANUAL

Table of Contents

1.Quick start

2.Cancer pre-processing pipeline

Examples

Kancero for v6 (WGS and Exome)

Kancero PDX preprocessing pipeline for v6

3.Running QC report generation

4.Concordance and contamination checking

ConPair summart after EastRiver

Running ConPair through kancero

Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples

5.Additional steps for projects with contemporary normal

6.Running the somatic calling pipeline

7.Running sample and project report generation

8.Delivering results through kancero

Appendix

A. File format : subject metadata file

B. File format : tumor normal pairs file

C. Github best practices

D. Use Kancero to plot cluster usage

E. Use on partly delivered directories

F. Use Kancero to list available resources

G. Preprocess outside FASTQs

H. Kancero and the new directory structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

nygenome/kancero

Folders and files

Latest commit

History

Repository files navigation

sampleMANUAL

Table of Contents

1.Quick start

2.Cancer pre-processing pipeline

Examples

Kancero for v6 (WGS and Exome)

Kancero PDX preprocessing pipeline for v6

3.Running QC report generation

4.Concordance and contamination checking

ConPair summart after EastRiver

Running ConPair through kancero

Running kancero for DNA/RNA concordance or sample-sample concordance between 2 different DNA samples

5.Additional steps for projects with contemporary normal

6.Running the somatic calling pipeline

7.Running sample and project report generation

8.Delivering results through kancero

Appendix

A. File format : subject metadata file

B. File format : tumor normal pairs file

C. Github best practices

D. Use Kancero to plot cluster usage

E. Use on partly delivered directories

F. Use Kancero to list available resources

G. Preprocess outside FASTQs

H. Kancero and the new directory structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages