Skip to content

Commit fcc099a

Browse files
authored
Merge pull request #39 from zavolanlab/aleksei/strand_tags_and_isoforms
implemented UMI-based deduplication, appending of polyA tails, accounting for strandedness
2 parents 2f8db1b + 7d4b6d5 commit fcc099a

8 files changed

Lines changed: 825 additions & 53 deletions

File tree

README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,19 @@ It is tailored for GPU-accelerated basecalling, producing quality control plots,
66

77
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Conda environments to automatically manage and isolate software dependencies.
88

9+
The pipeline takes advantage of several functions implemented in [zavolab_pyutils](https://github.com/zavolanlab/zavolab_pyutils/tree/dev) and [SCINPAS](https://github.com/zavolanlab/SCINPAS/tree/making_CLI_interface_for_python_functions) packages.
10+
911
## Pipeline Summary
1012
1. **GPU Basecalling (`dorado`)**: Performs basecalling and poly-A tail estimation natively on NVIDIA GPUs.
11-
2. **Alignment & Merging (`minimap2`)**: Maps reads to the reference genome using minimap2 and merges BAMs (e.g. technical replicates from samples).
12-
3. **De novo transcriptome assembly**: Enriches input transcriptome annotations based on observed read alignments
13-
4. **Isoform Analysis (`custom python`)**: Assigns alignments to specific transcript isoforms.
14-
5. **Annotating raw current data from a subsample of reads**:
15-
5.1 **Signal Extraction (`pod5`)**: Subsamples reads of interest and extracts corresponding raw signal chunks from `.pod5` files.
16-
5.2 **Move-table Emission (`dorado`)**: Re-processes subsetted reads to emit basecaller move-tables.
17-
5.3 **Dataframe Generation & QC Visualization (`seaborn`)**: Synchronizes sequence strings with raw signal variations and renders PDF plots for individual reads.
13+
2. **Reverse-complementing reads from backward strand**: sequencing can be done in both directions in the cDNA protocol (optional).
14+
3. **Alignment & Merging (`minimap2`)**: Maps reads to the reference genome using minimap2 and merges BAMs (e.g. technical replicates from samples).
15+
4. **UMI deduplication (`umi_tools`, `custom python`)**: normalizing UMI lengths -> UMI- and alignment-based deduplication -> correction of NH tag and MAPQ values.
16+
5. **De novo transcriptome assembly (`custom python`)**: Enriches input transcriptome annotations based on observed read alignments
17+
6. **Isoform Analysis (`custom python`)**: Assigns alignments to specific transcript isoforms.
18+
7. **QC: annotating raw current data from a subsample of reads**:
19+
7.1 **Signal Extraction (`pod5`)**: Subsamples reads of interest and extracts corresponding raw signal chunks from `.pod5` files.
20+
7.2 **Move-table Emission (`dorado`)**: Re-processes subsetted reads to emit basecaller move-tables.
21+
7.3 **Dataframe Generation & QC Visualization (`seaborn`)**: Synchronizes sequence strings with raw signal variations and renders PDF plots for individual reads.
1822

1923
## Quick Start
2024

bin/append_polyA_tail.py

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
#!/usr/bin/env python3
2+
3+
import warnings
4+
warnings.simplefilter('ignore')
5+
6+
import sys
7+
from argparse import ArgumentParser, RawTextHelpFormatter
8+
import os
9+
import numpy as np
10+
import pysam
11+
import csv
12+
13+
def get_rc(seq):
14+
"""Return the reverse complement of a sequence."""
15+
trans = str.maketrans("ACGTUacgtuNn", "TGCAAtgcaaNn")
16+
return seq.translate(trans)[::-1]
17+
18+
def main():
19+
parser = ArgumentParser(description="Append polyA tails based on pt and fixed cleavage site tags.")
20+
parser.add_argument("--input_bam", required=True)
21+
parser.add_argument("--output_appended_bam", required=True)
22+
parser.add_argument("--output_skipped_bam", required=True)
23+
parser.add_argument("--stats_tsv", required=True)
24+
parser.add_argument("--sample_id", required=True)
25+
parser.add_argument("--tag_orig_cs", default="XO")
26+
parser.add_argument("--tag_fixed_cs", default="XF")
27+
28+
args = parser.parse_args()
29+
30+
almnt_file = pysam.AlignmentFile(args.input_bam, "rb")
31+
bam_appended = pysam.AlignmentFile(args.output_appended_bam, "wb", header=almnt_file.header)
32+
bam_skipped = pysam.AlignmentFile(args.output_skipped_bam, "wb", header=almnt_file.header)
33+
34+
reads_total = 0
35+
reads_appended = 0
36+
reads_skipped_no_pt = 0
37+
reads_skipped_pt_outside_valid_range = 0
38+
reads_skipped_error = 0
39+
40+
for almnt in almnt_file:
41+
reads_total += 1
42+
43+
# 1. Check for valid pt tag
44+
try:
45+
pt = almnt.get_tag('pt')
46+
except KeyError:
47+
pt = -1
48+
reads_skipped_no_pt += 1
49+
bam_skipped.write(almnt)
50+
continue
51+
52+
if pt <= 0:
53+
bam_skipped.write(almnt)
54+
reads_skipped_pt_outside_valid_range += 1
55+
continue
56+
57+
# 2. Get Cleavage Site Shift Difference
58+
try:
59+
OCS = almnt.get_tag(args.tag_orig_cs)
60+
FCS = almnt.get_tag(args.tag_fixed_cs)
61+
# The absolute difference is exactly how many bases were "rescued" from the softclip
62+
difference = abs(OCS - FCS)
63+
except KeyError:
64+
bam_skipped.write(almnt)
65+
reads_skipped_error += 1
66+
continue
67+
68+
# 3. Orient to transcript 5' -> 3'
69+
if almnt.is_forward:
70+
read_seq = almnt.query_sequence
71+
read_qualstr = almnt.query_qualities
72+
cigar = list(almnt.cigar)
73+
else:
74+
read_seq = almnt.get_forward_sequence()
75+
read_qualstr = almnt.query_qualities[::-1] if almnt.query_qualities else None
76+
cigar = list(almnt.cigar)[::-1]
77+
78+
# 4. Calculate exact truncation
79+
old_clip_len = cigar[-1][1] if cigar[-1][0] == 4 else 0
80+
trim_len = old_clip_len - difference
81+
82+
if trim_len < 0:
83+
bam_skipped.write(almnt)
84+
reads_skipped_error += 1
85+
continue
86+
87+
# 5. Modify Sequence
88+
new_read_seq = read_seq[:(-trim_len if trim_len > 0 else None)] + "A" * pt
89+
90+
# 6. Modify Quality String
91+
if read_qualstr is not None:
92+
if trim_len > 0:
93+
trimmed_quals = read_qualstr[-trim_len:]
94+
else:
95+
trimmed_quals = read_qualstr[-(min(5, len(read_seq))):] if len(read_seq) > 0 else [30]
96+
97+
quality_val = int(np.round(np.mean(trimmed_quals), 0)) if len(trimmed_quals) > 0 else 30
98+
99+
new_read_qualstr = list(read_qualstr[:(-trim_len if trim_len > 0 else None)])
100+
new_read_qualstr.extend([quality_val] * pt)
101+
else:
102+
new_read_qualstr = None
103+
104+
# 7. Modify CIGAR
105+
new_cigar = [[op, length] for op, length in cigar]
106+
107+
if new_cigar[-1][0] == 4:
108+
new_cigar.pop() # Remove old 3' soft-clip
109+
110+
if difference > 0:
111+
for idx in range(len(new_cigar)-1, -1, -1):
112+
if new_cigar[idx][0] == 0: # Add rescued bases to the last Match (M) block
113+
new_cigar[idx][1] += difference
114+
break
115+
116+
if pt > 0:
117+
new_cigar.append([4, pt]) # Append new soft-clip of length pt
118+
119+
# Validation Check
120+
cigar_query_len = sum(length for op, length in new_cigar if op in [0, 1, 4, 7, 8])
121+
if len(new_read_seq) != cigar_query_len or (new_read_qualstr and len(new_read_seq) != len(new_read_qualstr)):
122+
bam_skipped.write(almnt)
123+
reads_skipped_error += 1
124+
continue
125+
126+
# 8. Convert back to original BAM orientation
127+
if almnt.is_forward:
128+
almnt.query_sequence = new_read_seq
129+
almnt.query_qualities = new_read_qualstr
130+
almnt.cigar = new_cigar
131+
else:
132+
almnt.query_sequence = get_rc(new_read_seq)
133+
if new_read_qualstr is not None:
134+
almnt.query_qualities = new_read_qualstr[::-1]
135+
almnt.cigar = new_cigar[::-1]
136+
137+
almnt.set_tag('pa', 1, 'i')
138+
bam_appended.write(almnt)
139+
reads_appended += 1
140+
141+
almnt_file.close()
142+
bam_appended.close()
143+
bam_skipped.close()
144+
145+
# 9. Write TSV Stats
146+
with open(args.stats_tsv, 'w', newline='') as tsv_file:
147+
writer = csv.writer(tsv_file, delimiter='\t')
148+
writer.writerow(["sample_id", "chunk_filename", "total_alignments", "appended_alignments", "skipped_no_pt", "skipped_error", "skipped_pt_outside_valid_range"])
149+
chunk_name = os.path.basename(args.input_bam)
150+
writer.writerow([args.sample_id, chunk_name, reads_total, reads_appended, reads_skipped_no_pt, reads_skipped_error, reads_skipped_pt_outside_valid_range])
151+
152+
if __name__ == '__main__':
153+
try:
154+
main()
155+
except KeyboardInterrupt:
156+
sys.stderr.write("User interrupt!")
157+
sys.exit(1)

bin/extract_cs_bigwig.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env python3
2+
3+
import pysam
4+
import argparse
5+
import sys
6+
7+
def main():
8+
parser = argparse.ArgumentParser(description="Extract 1-based cleavage sites from a BAM custom tag and convert to 0-based BED format.")
9+
parser.add_argument("--bam_in", required=True, help="Input BAM file with uniquely mapped reads")
10+
parser.add_argument("--bed_out", required=True, help="Output BED file")
11+
parser.add_argument("--tag", required=True, help="The SAM tag storing the 1-based cleavage site (e.g., XF or XO)")
12+
parser.add_argument("--MAPQ_min", type=int, required=False, default=255, help="minimal MAPQ for an alignment to be included")
13+
args = parser.parse_args()
14+
15+
with pysam.AlignmentFile(args.bam_in, "rb") as bam, open(args.bed_out, "w") as f:
16+
for r in bam:
17+
# MAPQ 255 represents unique alignments in your pipeline
18+
if r.mapping_quality >= args.MAPQ_min and not r.is_unmapped:
19+
try:
20+
cs_1based = r.get_tag(args.tag)
21+
# Convert 1-based tag coordinate to 0-based BED interval (length = 1)
22+
start = cs_1based - 1
23+
end = cs_1based
24+
strand = "-" if r.is_reverse else "+"
25+
26+
# Write in BED6 format: chrom, start, end, name, score, strand
27+
f.write(f"{r.reference_name}\t{start}\t{end}\t{r.query_name}\t0\t{strand}\n")
28+
except KeyError:
29+
# Skip reads that are missing the designated cleavage site tag
30+
pass
31+
32+
if __name__ == "__main__":
33+
try:
34+
main()
35+
except KeyboardInterrupt:
36+
sys.stderr.write("User interrupt!\n")
37+
sys.exit(1)

bin/orient_reads.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env python3
2+
3+
import pysam
4+
import argparse
5+
import csv
6+
import sys
7+
import os
8+
9+
def get_rc(seq):
10+
"""Return the reverse complement of a sequence."""
11+
trans = str.maketrans("ACGTUacgtuNn", "TGCAAtgcaaNn")
12+
return seq.translate(trans)[::-1]
13+
14+
def main():
15+
parser = argparse.ArgumentParser(description="Orient unmapped BAM reads to forward strand based on basecaller tag.")
16+
parser.add_argument("--input_bam", required=True, help="Input unmapped BAM file")
17+
parser.add_argument("--output_bam", required=True, help="Output oriented BAM file")
18+
parser.add_argument("--stats_tsv", required=True, help="Output TSV with tag statistics")
19+
parser.add_argument("--sample_id", required=True, help="Sample ID for stats tracking")
20+
21+
# NEW ARGUMENTS FOR DYNAMIC TAGGING
22+
parser.add_argument("--tag_basecaller_ts", required=True, help="Tag storing the basecaller strand (e.g., TS)")
23+
parser.add_argument("--tag_original_ts", required=True, help="Custom tag to store the original basecaller strand (e.g., ZS)")
24+
args = parser.parse_args()
25+
26+
reads_with_ts = 0
27+
reads_without_ts = 0
28+
reads_reverse_complemented = 0
29+
30+
# check_sq=False is critical because unmapped BAMs lack reference contig headers
31+
with pysam.AlignmentFile(args.input_bam, "rb", check_sq=False) as bam_in, \
32+
pysam.AlignmentFile(args.output_bam, "wb", template=bam_in) as bam_out:
33+
34+
for read in bam_in:
35+
try:
36+
# 1. Fetch the original basecaller tag
37+
ts = read.get_tag(args.tag_basecaller_ts)
38+
reads_with_ts += 1
39+
40+
# 2. Store it as a custom tag for traceability
41+
read.set_tag(args.tag_original_ts, ts, value_type="A")
42+
43+
# 3. Safely delete the original tag using pysam's in-place deletion
44+
# This bypasses the set_tags() bug with 'B' type binary arrays!
45+
read.set_tag(args.tag_basecaller_ts, None)
46+
47+
# 4. Flip the sequence and quality scores if Dorado marked it as negative
48+
if ts == "-":
49+
# Reverse complement sequence
50+
orig_seq = read.query_sequence
51+
read.query_sequence = get_rc(orig_seq)
52+
53+
# Reverse quality scores
54+
orig_qual = read.query_qualities
55+
if orig_qual is not None:
56+
read.query_qualities = orig_qual[::-1]
57+
58+
reads_reverse_complemented += 1
59+
60+
except KeyError:
61+
reads_without_ts += 1
62+
63+
bam_out.write(read)
64+
65+
# Write the chunk statistics to a TSV
66+
with open(args.stats_tsv, 'w', newline='') as tsv_file:
67+
writer = csv.writer(tsv_file, delimiter='\t')
68+
# Dynamically set header based on the provided tag name
69+
writer.writerow(["sample_id", "chunk_filename", f"reads_with_{args.tag_basecaller_ts}", f"reads_without_{args.tag_basecaller_ts}", "reads_reverse_complemented"])
70+
chunk_name = os.path.basename(args.input_bam)
71+
writer.writerow([args.sample_id, chunk_name, reads_with_ts, reads_without_ts, reads_reverse_complemented])
72+
73+
if __name__ == "__main__":
74+
main()
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
name: nf-bam_processing_with_python
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
dependencies:
6+
- python=3.10
7+
- bioconda::bedtools # non-python dependency for zavolab_pyutils
8+
- bioconda::samtools # non-python dependency, always needed for BAM processing
9+
- conda-forge::pigz
10+
- bioconda::ucsc-wigtobigwig
11+
- bioconda::ucsc-bedgraphtobigwig
12+
- pip
13+
- pip:
14+
# This tells pip to install directly from the branches of the GitHub repos
15+
- git+https://github.com/zavolanlab/zavolab_pyutils.git@dev
16+
- git+https://github.com/zavolanlab/SCINPAS.git@making_CLI_interface_for_python_functions
17+
18+
# Cache Bust: Added April 21 14:39 to force pip pulling the latest version of zavolab_pyutils

envs/umi_tools.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
name: nf-umi_tools
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
dependencies:
6+
- python=3.10
7+
- bioconda::umi_tools
8+
- bioconda::samtools

0 commit comments

Comments
 (0)