Skip to content

Commit 307c086

Browse files
authored
Merge pull request #14 from pirl-unc/dev
update docs
2 parents f4554c5 + 0089851 commit 307c086

4 files changed

Lines changed: 275 additions & 4 deletions

File tree

.gitignore

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,7 @@ target/
1414
docs/_site
1515
docs/.quarto/
1616
docs/cli/*
17-
docs/pipelines/*
1817
!docs/cli/index.qmd
19-
!docs/pipelines/index.qmd
2018
Cargo.lock
2119
.nojekyll
2220
*.nextflow.log*

docs/.gitignore

Lines changed: 0 additions & 2 deletions
This file was deleted.
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
title: "Mutant Proteoform Prediction"
3+
description: "Call somatic/germline DNA and RNA variants, integrate them, and translate transcripts to mutant proteoforms using long-read DNA + RNA sequencing data."
4+
---
5+
6+
This pipeline is Exacto's primary use case: identify germline and
7+
somatic DNA variants as well as RNA variants from a case sample
8+
(e.g. a tumor), integrate the DNA and RNA variants, and translate the mutant peptide sequences they encode.
9+
10+
External tools you'll need alongside exacto:
11+
12+
- [`Minimap2`](https://github.com/lh3/minimap2) — long-read DNA/RNA alignment
13+
- [`Samtools`](https://www.htslib.org/) — BAM manipulation and indexing
14+
- [`RNA-Bloom2`](https://github.com/bcgsc/RNA-Bloom) — long-read transcriptome assembly
15+
16+
## Workflow
17+
18+
```{mermaid}
19+
%%{init: {'securityLevel': 'loose', 'flowchart': {'rankSpacing': 20, 'nodeSpacing': 40, 'subGraphTitleMargin': {'top': 5, 'bottom': 20}}}}%%
20+
flowchart TD
21+
subgraph DNA["<span style='white-space:nowrap;color:#414a4c'>DNA variant calling</span>"]
22+
direction LR
23+
DNAIN[/"<span style='white-space:nowrap;color:#414a4c'>Tumor + normal BAMs</span>"/] --> CALLDNA(call-somatic-dna-vars)
24+
CALLDNA --> ANN(annotate-vars)
25+
ANN --> ANNOUT[/"<span style='white-space:nowrap'>Annotated DNA variants TSV</span>"/]
26+
end
27+
28+
subgraph RNA["<span style='white-space:nowrap;color:#414a4c'>RNA variant calling</span>"]
29+
direction LR
30+
RNAIN[/"<span style='white-space:nowrap;color:#414a4c'>Assembled transcriptome BAM</span>"/] --> RM(remove-unspliced-rnas)
31+
RM --> CALLRNA(call-rna-vars)
32+
CALLRNA --> RNAOUT[/"<span style='white-space:nowrap;color:#414a4c'>RNA variants TSV</span>"/]
33+
CALLRNA --> TS[/"<span style='white-space:nowrap;color:#414a4c'>Transcript structures TSV</span>"/]
34+
end
35+
36+
ANNOUT --> INT(integrate-vars)
37+
RNAOUT --> INT
38+
INT --> INTOUT[/"<span style='white-space:nowrap;color:#414a4c'>Integrated DNA + RNA variants TSV</span>"/]
39+
40+
TS --> TR(translate-structs)
41+
RNAOUT --> TR
42+
INTOUT --> TR
43+
TR --> PRIMTSV[/"<span style='white-space:nowrap;color:#414a4c'>Primary structures TSV</span>"/]
44+
TR --> PRIMFA[/"<span style='white-space:nowrap;color:#414a4c'>Primary structures FASTA</span>"/]
45+
46+
PRIMTSV --> CPV(call-peptide-vars)
47+
CPV --> PEPS[/"<span style='white-space:nowrap;color:#414a4c'>Peptide variants TSV</span>"/]
48+
49+
style DNA fill:#ffffff,stroke:#bbbbbb
50+
style RNA fill:#ffffff,stroke:#bbbbbb
51+
52+
click CALLDNA href "../cli/call-somatic-dna-vars.html" "View call-somatic-dna-vars docs" _self
53+
click ANN href "../cli/annotate-vars.html" "View annotate-vars docs" _self
54+
click RM href "../cli/remove-unspliced-rnas.html" "View remove-unspliced-rnas docs" _self
55+
click CALLRNA href "../cli/call-rna-vars.html" "View call-rna-vars docs" _self
56+
click INT href "../cli/integrate-vars.html" "View integrate-vars docs" _self
57+
click TR href "../cli/translate-structs.html" "View translate-structs docs" _self
58+
click CPV href "../cli/call-peptide-vars.html" "View call-peptide-vars docs" _self
59+
60+
classDef linked text-decoration:underline;
61+
class CALLDNA,ANN,RM,CALLRNA,INT,TR,CPV linked
62+
```
63+
64+
## Step 1. Align long reads
65+
66+
Align tumor and normal long-read DNA to the reference genome with Minimap2, then sort and index with samtools.
67+
Please make sure `--cs` is specified for Minimap2 as Exacto relies on the `CS` tag to identify variants:
68+
69+
```bash
70+
# Tumor
71+
minimap2 -ax map-hifi --cs --eqx -Y -L --secondary=no \
72+
reference.fasta tumor_dna.fastq.gz \
73+
| samtools sort -o tumor_dna.sorted.bam
74+
samtools index tumor_dna.sorted.bam
75+
76+
# Normal — repeat with normal_dna.fastq.gz → normal_dna.sorted.bam
77+
```
78+
79+
## Step 2. Identify somatic DNA variants
80+
81+
Identify case-specific (somatic) variants in tumor against matched normal:
82+
83+
```bash
84+
exacto call-somatic-dna-vars \
85+
--bam-file tumor_dna.sorted.bam \
86+
--bam-bai-file tumor_dna.sorted.bam.bai \
87+
--fasta-file reference.fasta \
88+
--control-bam-files normal_dna.sorted.bam \
89+
--control-bam-bai-files normal_dna.sorted.bam.bai \
90+
--output-tsv-file tumor_specific_dna_variants.tsv
91+
```
92+
93+
## Step 3. Annotate the somatic DNA variants
94+
95+
Add gene/isoform level contexts using a GENCODE GTF:
96+
97+
```bash
98+
exacto annotate-vars \
99+
--tsv-file tumor_specific_dna_variants.tsv \
100+
--reference-gene-annotation-file gencode.gtf.gz \
101+
--reference-gene-annotation-source gencode \
102+
--reference-gene-annotation-assembly hg38 \
103+
--reference-gene-annotation-version v45 \
104+
--output-tsv-file tumor_specific_dna_variants.annotated.tsv
105+
```
106+
107+
## Step 4. Assemble and align the tumor transcriptome
108+
109+
Assemble long-read RNA with [RNA-Bloom2](https://github.com/BirolLab/RNA-Bloom), then align the assembled
110+
contigs back to the reference genome with minimap2 and sort/index with samtools. Transcriptome assembly is necessary
111+
because polyA-capture long-read RNA-seq commonly yields 5'-truncated reads; the assembler stitches them into full-length
112+
transcripts.
113+
114+
Assemble tumor transcripts using RNA-bloom2:
115+
```bash
116+
java -jar RNA-Bloom.jar \
117+
-long tumor_rna.fastq.gz \
118+
--outdir rnabloom2_outputs/ \
119+
-chimera [-lrpb]
120+
```
121+
122+
Filter RNA-bloom2 transcripts using [Nexus](https://pirl-unc.github.io/nexus/utilities/filter_rnabloom2_transcripts.html):
123+
```bash
124+
nexus_filter_rnabloom2_transcripts \
125+
--assembly4-pol-fasta-file rnabloom2_outputs/rnabloom.longreads.assembly4.pol.fa \
126+
--assembly3-map-paf-file rnabloom2_outputs/rnabloom.longreads.assembly3.map.paf.gz \
127+
--output-reads-tsv-file rnabloom2_outputs/rnalboom_longreads_filtered_reads.tsv \
128+
--output-transcripts-tsv-file rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.tsv \
129+
--output-fasta-file rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.fasta
130+
```
131+
132+
Align the assembled tumor transcriptome. Please make sure `--cs` is specified for Minimap2 as Exacto relies on the `CS` tag to identify variants:
133+
```bash
134+
minimap2 -ax splice:hq -uf --cs --eqx -Y -L --secondary=no \
135+
reference.fasta rnabloom2_outputs/rnalboom_longreads_filtered_transcripts.fasta \
136+
| samtools sort -o tumor_rna_assembly.sorted.bam
137+
samtools index tumor_rna_assembly.sorted.bam
138+
```
139+
140+
## Step 5. Filter unspliced RNAs
141+
142+
Drop assembled transcripts that are likely unspliced RNAs. Note that `remove-unspliced-rnas` keeps transcripts
143+
overlapping 1-exon reference transcripts:
144+
145+
```bash
146+
exacto remove-unspliced-rnas \
147+
--bam-file tumor_rna_assembly.sorted.bam \
148+
--bam-bai-file tumor_rna_assembly.sorted.bam.bai \
149+
--fasta-file reference.fasta \
150+
--reference-gene-annotation-file gencode.gtf.gz \
151+
--reference-gene-annotation-source gencode \
152+
--reference-gene-annotation-assembly hg38 \
153+
--reference-gene-annotation-version v44 \
154+
--output-bam-file tumor_rna_assembly.sorted.filtered.bam \
155+
--output-bam-bai-file tumor_rna_assembly.sorted.filtered.bam.bai \
156+
--output-fasta-file tumor_rna_assembly.sorted.filtered.fasta
157+
```
158+
159+
## Step 6. Identify tumor RNA variants
160+
161+
```bash
162+
exacto call-rna-vars \
163+
--bam-file tumor_rna_assembly.sorted.filtered.bam \
164+
--bam-bai-file tumor_rna_assembly.sorted.filtered.bam.bai \
165+
--reference-genome-fasta-file hg38.fasta \
166+
--reference-gene-annotation-file gencode.gtf.gz \
167+
--reference-gene-annotation-source gencode \
168+
--reference-gene-annotation-assembly hg38 \
169+
--reference-gene-annotation-version v45 \
170+
--output-dir rna_variants_outputs/ \
171+
--output-prefix tumor
172+
```
173+
174+
## Step 7. Integrate DNA and RNA variants
175+
176+
```bash
177+
exacto integrate-vars \
178+
--annotated-dna-vars-tsv-file tumor_specific_dna_variants.annotated.tsv \
179+
--rna-vars-tsv-file rna_variants_outputs/tumor_exacto_rna_variant_calls.tsv \
180+
--reference-gene-annotation-file gencode.gtf.gz \
181+
--reference-gene-annotation-source gencode \
182+
--reference-gene-annotation-assembly hg38 \
183+
--reference-gene-annotation-version v44 \
184+
--output-tsv-file tumor_dna_rna_variants_integrated.tsv
185+
```
186+
187+
## Step 8. Translate transcripts to primary structures
188+
189+
```bash
190+
exacto translate-structs \
191+
--transcript-structures-tsv-file rna_variants_outputs/tumor_exacto_transcript_structures.tsv \
192+
--rna-variant-calls-tsv-file rna_variants_outputs/tumor_exacto_rna_variant_calls.tsv \
193+
--integrated-variants-tsv-file tumor_dna_rna_variants_integrated.tsv \
194+
--strategy longest_orf \
195+
--output-tsv-file tumor_primary_structures.tsv \
196+
--output-fasta-file tumor_primary_structures.fasta
197+
```
198+
199+
## Step 9. Identify peptide variants
200+
201+
```bash
202+
exacto call-peptide-vars \
203+
--primary-structures-tsv-file tumor_primary_structures.tsv \
204+
--reference-fasta-file reference_proteome.fasta \
205+
--output-tsv-file tumor_peptide_variants.tsv \
206+
--output-fasta-file tumor_peptide_variants.fasta
207+
```
208+
209+
## Outputs
210+
211+
| File | Produced by | Description |
212+
|:-----|:------------|:------------|
213+
| `tumor_specific_dna_variants.tsv` | `call-somatic-dna-vars` | Somatic DNA variants |
214+
| `tumor_specific_dna_variants.annotated.tsv` | `annotate-vars` | Annotated DNA variants |
215+
| `tumor_exacto_rna_variant_calls.tsv` | `call-rna-vars` | RNA variants |
216+
| `tumor_exacto_transcript_structures.tsv` | `call-rna-vars` | Per-transcript structural records |
217+
| `tumor_dna_rna_variants_integrated.tsv` | `integrate-vars` | DNA + RNA variants merged |
218+
| `tumor_primary_structures.fasta` | `translate-structs` | Mutant proteoform sequences (FASTA) |
219+
| `tumor_peptide_variants.tsv` | `call-peptide-vars` | Mutant peptide variants |
220+
221+
: {.striped .hover}
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: "Variation Graph Construction"
3+
description: "Build individualized genome and transcriptome variation graphs from Exacto variant calls."
4+
---
5+
6+
This pipeline produces individualized genome and transcriptome variation graphs that encode an individual's variants
7+
alongside the reference sequence. These graphs are useful for downstream variant-aware analyses that benefit from a
8+
graph reference instead of a linear one.
9+
10+
## Workflow
11+
12+
```{mermaid}
13+
%%{init: {'securityLevel': 'loose'}}%%
14+
flowchart LR
15+
INTV[/"<span style='white-space:nowrap;color:#414a4c'>DNA variants TSV</span>"/] --> BGV(build‑genome‑var‑graph)
16+
REFG[("<span style='white-space:nowrap;color:#414a4c'>Reference genome FASTA</span>")] --> BGV
17+
BGV --> GG[/"<span style='white-space:nowrap;color:#414a4c'>Genome variation graph</span>"/]
18+
19+
PRIMS[/"<span style='white-space:nowrap;color:#414a4c'>Transcript structures TSV</span>"/] --> BTV(build‑transcriptome‑var‑graph)
20+
REFT[("<span style='white-space:nowrap;color:#414a4c'>Reference transcriptome FASTA</span>")] --> BTV
21+
BTV --> TG[/"<span style='white-space:nowrap;color:#414a4c'>Transcriptome variation graph</span>"/]
22+
23+
click BGV href "../cli/build-genome-var-graph.html" "View build-genome-var-graph docs" _self
24+
click BTV href "../cli/build-transcriptome-var-graph.html" "View build-transcriptome-var-graph docs" _self
25+
26+
classDef linked text-decoration:underline;
27+
class BGV,BTV linked
28+
```
29+
30+
DNA variants TSV can be obtained by running [`call-germline-dna-vars`](../cli/call-germline-dna-vars.qmd) or [`call-somatic-dna-vars`](../cli/call-somatic-dna-vars.qmd).
31+
32+
Transcript structures TSV can be obtained by running [`call-rna-vars`](../cli/call-rna-vars.qmd).
33+
34+
## Genome variation graph
35+
36+
Add germline and/or somatic DNA variants to a reference genome to produce a individualized genome graph:
37+
38+
```bash
39+
exacto build-genome-var-graph \
40+
--variants-tsv-file dna_variants.tsv \
41+
--fasta-file reference_genome.fasta \
42+
--output-dir genome_var_graph_outputs/
43+
```
44+
45+
## Transcriptome variation graph
46+
47+
Add RNA variants to a reference transcriptome to produce a individualized transcriptome graph:
48+
49+
```bash
50+
exacto build-transcriptome-var-graph \
51+
--transcript-structures-tsv-file transcript_structures.tsv \
52+
--fasta-file reference_transcriptome.fasta \
53+
--output-fasta-file transcriptome_var_graph.fasta
54+
```

0 commit comments

Comments
 (0)