|
| 1 | +--- |
| 2 | +title: "Mutant Proteoform Prediction" |
| 3 | +description: "Call somatic/germline DNA and RNA variants, integrate them, and translate transcripts to mutant proteoforms using long-read DNA + RNA sequencing data." |
| 4 | +--- |
| 5 | + |
| 6 | +This pipeline is Exacto's primary use case: identify germline and |
| 7 | +somatic DNA variants as well as RNA variants from a case sample |
| 8 | +(e.g. a tumor), integrate the DNA and RNA variants, and translate the mutant peptide sequences they encode. |
| 9 | + |
| 10 | +External tools you'll need alongside exacto: |
| 11 | + |
| 12 | +- [`Minimap2`](https://github.com/lh3/minimap2) — long-read DNA/RNA alignment |
| 13 | +- [`Samtools`](https://www.htslib.org/) — BAM manipulation and indexing |
| 14 | +- [`RNA-Bloom2`](https://github.com/bcgsc/RNA-Bloom) — long-read transcriptome assembly |
| 15 | + |
| 16 | +## Workflow |
| 17 | + |
| 18 | +```{mermaid} |
| 19 | +%%{init: {'securityLevel': 'loose', 'flowchart': {'rankSpacing': 20, 'nodeSpacing': 40, 'subGraphTitleMargin': {'top': 5, 'bottom': 20}}}}%% |
| 20 | +flowchart TD |
| 21 | + subgraph DNA["<span style='white-space:nowrap;color:#414a4c'>DNA variant calling</span>"] |
| 22 | + direction LR |
| 23 | + DNAIN[/"<span style='white-space:nowrap;color:#414a4c'>Tumor + normal BAMs</span>"/] --> CALLDNA(call-somatic-dna-vars) |
| 24 | + CALLDNA --> ANN(annotate-vars) |
| 25 | + ANN --> ANNOUT[/"<span style='white-space:nowrap'>Annotated DNA variants TSV</span>"/] |
| 26 | + end |
| 27 | +
|
| 28 | + subgraph RNA["<span style='white-space:nowrap;color:#414a4c'>RNA variant calling</span>"] |
| 29 | + direction LR |
| 30 | + RNAIN[/"<span style='white-space:nowrap;color:#414a4c'>Assembled transcriptome BAM</span>"/] --> RM(remove-unspliced-rnas) |
| 31 | + RM --> CALLRNA(call-rna-vars) |
| 32 | + CALLRNA --> RNAOUT[/"<span style='white-space:nowrap;color:#414a4c'>RNA variants TSV</span>"/] |
| 33 | + CALLRNA --> TS[/"<span style='white-space:nowrap;color:#414a4c'>Transcript structures TSV</span>"/] |
| 34 | + end |
| 35 | +
|
| 36 | + ANNOUT --> INT(integrate-vars) |
| 37 | + RNAOUT --> INT |
| 38 | + INT --> INTOUT[/"<span style='white-space:nowrap;color:#414a4c'>Integrated DNA + RNA variants TSV</span>"/] |
| 39 | +
|
| 40 | + TS --> TR(translate-structs) |
| 41 | + RNAOUT --> TR |
| 42 | + INTOUT --> TR |
| 43 | + TR --> PRIMTSV[/"<span style='white-space:nowrap;color:#414a4c'>Primary structures TSV</span>"/] |
| 44 | + TR --> PRIMFA[/"<span style='white-space:nowrap;color:#414a4c'>Primary structures FASTA</span>"/] |
| 45 | +
|
| 46 | + PRIMTSV --> CPV(call-peptide-vars) |
| 47 | + CPV --> PEPS[/"<span style='white-space:nowrap;color:#414a4c'>Peptide variants TSV</span>"/] |
| 48 | +
|
| 49 | + style DNA fill:#ffffff,stroke:#bbbbbb |
| 50 | + style RNA fill:#ffffff,stroke:#bbbbbb |
| 51 | +
|
| 52 | + click CALLDNA href "../cli/call-somatic-dna-vars.html" "View call-somatic-dna-vars docs" _self |
| 53 | + click ANN href "../cli/annotate-vars.html" "View annotate-vars docs" _self |
| 54 | + click RM href "../cli/remove-unspliced-rnas.html" "View remove-unspliced-rnas docs" _self |
| 55 | + click CALLRNA href "../cli/call-rna-vars.html" "View call-rna-vars docs" _self |
| 56 | + click INT href "../cli/integrate-vars.html" "View integrate-vars docs" _self |
| 57 | + click TR href "../cli/translate-structs.html" "View translate-structs docs" _self |
| 58 | + click CPV href "../cli/call-peptide-vars.html" "View call-peptide-vars docs" _self |
| 59 | +
|
| 60 | + classDef linked text-decoration:underline; |
| 61 | + class CALLDNA,ANN,RM,CALLRNA,INT,TR,CPV linked |
| 62 | +``` |
| 63 | + |
| 64 | +## Step 1. Align long reads |
| 65 | + |
| 66 | +Align tumor and normal long-read DNA to the reference genome with Minimap2, then sort and index with samtools. |
| 67 | +Please make sure `--cs` is specified for Minimap2 as Exacto relies on the `CS` tag to identify variants: |
| 68 | + |
| 69 | +```bash |
| 70 | +# Tumor |
| 71 | +minimap2 -ax map-hifi --cs --eqx -Y -L --secondary=no \ |
| 72 | + reference.fasta tumor_dna.fastq.gz \ |
| 73 | + | samtools sort -o tumor_dna.sorted.bam |
| 74 | +samtools index tumor_dna.sorted.bam |
| 75 | + |
| 76 | +# Normal — repeat with normal_dna.fastq.gz → normal_dna.sorted.bam |
| 77 | +``` |
| 78 | + |
| 79 | +## Step 2. Identify somatic DNA variants |
| 80 | + |
| 81 | +Identify case-specific (somatic) variants in tumor against matched normal: |
| 82 | + |
| 83 | +```bash |
| 84 | +exacto call-somatic-dna-vars \ |
| 85 | + --bam-file tumor_dna.sorted.bam \ |
| 86 | + --bam-bai-file tumor_dna.sorted.bam.bai \ |
| 87 | + --fasta-file reference.fasta \ |
| 88 | + --control-bam-files normal_dna.sorted.bam \ |
| 89 | + --control-bam-bai-files normal_dna.sorted.bam.bai \ |
| 90 | + --output-tsv-file tumor_specific_dna_variants.tsv |
| 91 | +``` |
| 92 | + |
| 93 | +## Step 3. Annotate the somatic DNA variants |
| 94 | + |
| 95 | +Add gene/isoform level contexts using a GENCODE GTF: |
| 96 | + |
| 97 | +```bash |
| 98 | +exacto annotate-vars \ |
| 99 | + --tsv-file tumor_specific_dna_variants.tsv \ |
| 100 | + --reference-gene-annotation-file gencode.gtf.gz \ |
| 101 | + --reference-gene-annotation-source gencode \ |
| 102 | + --reference-gene-annotation-assembly hg38 \ |
| 103 | + --reference-gene-annotation-version v45 \ |
| 104 | + --output-tsv-file tumor_specific_dna_variants.annotated.tsv |
| 105 | +``` |
| 106 | + |
| 107 | +## Step 4. Assemble and align the tumor transcriptome |
| 108 | + |
| 109 | +Assemble long-read RNA with [RNA-Bloom2](https://github.com/BirolLab/RNA-Bloom), then align the assembled |
| 110 | +contigs back to the reference genome with minimap2 and sort/index with samtools. Transcriptome assembly is necessary |
| 111 | +because polyA-capture long-read RNA-seq commonly yields 5'-truncated reads; the assembler stitches them into full-length |
| 112 | +transcripts. |
| 113 | + |
| 114 | +Assemble tumor transcripts using RNA-bloom2: |
| 115 | +```bash |
| 116 | +java -jar RNA-Bloom.jar \ |
| 117 | + -long tumor_rna.fastq.gz \ |
| 118 | + --outdir rnabloom2_outputs/ \ |
| 119 | + -chimera [-lrpb] |
| 120 | +``` |
| 121 | + |
| 122 | +Filter RNA-bloom2 transcripts using [Nexus](https://pirl-unc.github.io/nexus/utilities/filter_rnabloom2_transcripts.html): |
| 123 | +```bash |
| 124 | +nexus_filter_rnabloom2_transcripts \ |
| 125 | + --assembly4-pol-fasta-file rnabloom2_outputs/rnabloom.longreads.assembly4.pol.fa \ |
| 126 | + --assembly3-map-paf-file rnabloom2_outputs/rnabloom.longreads.assembly3.map.paf.gz \ |
| 127 | + --output-reads-tsv-file rnabloom2_outputs/rnalboom_longreads_filtered_reads.tsv \ |
| 128 | + --output-transcripts-tsv-file rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.tsv \ |
| 129 | + --output-fasta-file rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.fasta |
| 130 | +``` |
| 131 | + |
| 132 | +Align the assembled tumor transcriptome. Please make sure `--cs` is specified for Minimap2 as Exacto relies on the `CS` tag to identify variants: |
| 133 | +```bash |
| 134 | +minimap2 -ax splice:hq -uf --cs --eqx -Y -L --secondary=no \ |
| 135 | + reference.fasta rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.fasta \ |
| 136 | + | samtools sort -o tumor_rna_assembly.sorted.bam |
| 137 | +samtools index tumor_rna_assembly.sorted.bam |
| 138 | +``` |
| 139 | + |
| 140 | +## Step 5. Filter unspliced RNAs |
| 141 | + |
| 142 | +Drop assembled transcripts that are likely unspliced RNAs. Note that `remove-unspliced-rnas` keeps transcripts |
| 143 | +overlapping 1-exon reference transcripts: |
| 144 | + |
| 145 | +```bash |
| 146 | +exacto remove-unspliced-rnas \ |
| 147 | + --bam-file tumor_rna_assembly.sorted.bam \ |
| 148 | + --bam-bai-file tumor_rna_assembly.sorted.bam.bai \ |
| 149 | + --fasta-file reference.fasta \ |
| 150 | + --reference-gene-annotation-file gencode.gtf.gz \ |
| 151 | + --reference-gene-annotation-source gencode \ |
| 152 | + --reference-gene-annotation-assembly hg38 \ |
| 153 | + --reference-gene-annotation-version v44 \ |
| 154 | + --output-bam-file tumor_rna_assembly.sorted.filtered.bam \ |
| 155 | + --output-bam-bai-file tumor_rna_assembly.sorted.filtered.bam.bai \ |
| 156 | + --output-fasta-file tumor_rna_assembly.sorted.filtered.fasta |
| 157 | +``` |
| 158 | + |
| 159 | +## Step 6. Identify tumor RNA variants |
| 160 | + |
| 161 | +```bash |
| 162 | +exacto call-rna-vars \ |
| 163 | + --bam-file tumor_rna_assembly.sorted.filtered.bam \ |
| 164 | + --bam-bai-file tumor_rna_assembly.sorted.filtered.bam.bai \ |
| 165 | + --reference-genome-fasta-file hg38.fasta \ |
| 166 | + --reference-gene-annotation-file gencode.gtf.gz \ |
| 167 | + --reference-gene-annotation-source gencode \ |
| 168 | + --reference-gene-annotation-assembly hg38 \ |
| 169 | + --reference-gene-annotation-version v45 \ |
| 170 | + --output-dir rna_variants_outputs/ \ |
| 171 | + --output-prefix tumor |
| 172 | +``` |
| 173 | + |
| 174 | +## Step 7. Integrate DNA and RNA variants |
| 175 | + |
| 176 | +```bash |
| 177 | +exacto integrate-vars \ |
| 178 | + --annotated-dna-vars-tsv-file tumor_specific_dna_variants.annotated.tsv \ |
| 179 | + --rna-vars-tsv-file rna_variants_outputs/tumor_exacto_rna_variant_calls.tsv \ |
| 180 | + --reference-gene-annotation-file gencode.gtf.gz \ |
| 181 | + --reference-gene-annotation-source gencode \ |
| 182 | + --reference-gene-annotation-assembly hg38 \ |
| 183 | + --reference-gene-annotation-version v44 \ |
| 184 | + --output-tsv-file tumor_dna_rna_variants_integrated.tsv |
| 185 | +``` |
| 186 | + |
| 187 | +## Step 8. Translate transcripts to primary structures |
| 188 | + |
| 189 | +```bash |
| 190 | +exacto translate-structs \ |
| 191 | + --transcript-structures-tsv-file rna_variants_outputs/tumor_exacto_transcript_structures.tsv \ |
| 192 | + --rna-variant-calls-tsv-file rna_variants_outputs/tumor_exacto_rna_variant_calls.tsv \ |
| 193 | + --integrated-variants-tsv-file tumor_dna_rna_variants_integrated.tsv \ |
| 194 | + --strategy longest_orf \ |
| 195 | + --output-tsv-file tumor_primary_structures.tsv \ |
| 196 | + --output-fasta-file tumor_primary_structures.fasta |
| 197 | +``` |
| 198 | + |
| 199 | +## Step 9. Identify peptide variants |
| 200 | + |
| 201 | +```bash |
| 202 | +exacto call-peptide-vars \ |
| 203 | + --primary-structures-tsv-file tumor_primary_structures.tsv \ |
| 204 | + --reference-fasta-file reference_proteome.fasta \ |
| 205 | + --output-tsv-file tumor_peptide_variants.tsv \ |
| 206 | + --output-fasta-file tumor_peptide_variants.fasta |
| 207 | +``` |
| 208 | + |
| 209 | +## Outputs |
| 210 | + |
| 211 | +| File | Produced by | Description | |
| 212 | +|:-----|:------------|:------------| |
| 213 | +| `tumor_specific_dna_variants.tsv` | `call-somatic-dna-vars` | Somatic DNA variants | |
| 214 | +| `tumor_specific_dna_variants.annotated.tsv` | `annotate-vars` | Annotated DNA variants | |
| 215 | +| `tumor_exacto_rna_variant_calls.tsv` | `call-rna-vars` | RNA variants | |
| 216 | +| `tumor_exacto_transcript_structures.tsv` | `call-rna-vars` | Per-transcript structural records | |
| 217 | +| `tumor_dna_rna_variants_integrated.tsv` | `integrate-vars` | DNA + RNA variants merged | |
| 218 | +| `tumor_primary_structures.fasta` | `translate-structs` | Mutant proteoform sequences (FASTA) | |
| 219 | +| `tumor_peptide_variants.tsv` | `call-peptide-vars` | Mutant peptide variants | |
| 220 | + |
| 221 | +: {.striped .hover} |
0 commit comments