zavolanlab · magmir71 · Apr 28, 2026 · May 5, 2026 · May 6, 2026 · May 20, 2026
diff --git a/README.md b/README.md
@@ -13,12 +13,15 @@ The pipeline takes advantage of several functions implemented in [zavolab_pyutil
 2. **Reverse-complementing reads from backward strand**: sequencing can be done in both directions in the cDNA protocol (optional).
 3. **Alignment & Merging (`minimap2`)**: Maps reads to the reference genome using minimap2 and merges BAMs (e.g. technical replicates from samples).
 4. **UMI deduplication (`umi_tools`, `custom python`)**: normalizing UMI lengths -> UMI- and alignment-based deduplication -> correction of NH tag and MAPQ values.
-5. **De novo transcriptome assembly (`custom python`)**: Enriches input transcriptome annotations based on observed read alignments
-6. **Isoform Analysis (`custom python`)**: Assigns alignments to specific transcript isoforms.
-7. **QC: annotating raw current data from a subsample of reads**: 
-7.1 **Signal Extraction (`pod5`)**: Subsamples reads of interest and extracts corresponding raw signal chunks from `.pod5` files.
-7.2 **Move-table Emission (`dorado`)**: Re-processes subsetted reads to emit basecaller move-tables. 
-7.3 **Dataframe Generation & QC Visualization (`seaborn`)**: Synchronizes sequence strings with raw signal variations and renders PDF plots for individual reads.
+5. **Assignment of reads to genes (`featureCounts`)**: uses featureCounts to produce .bam files where every read is assigned to gene ids from input .gtf file (`XT` tag).
+6. **Identification of CPA cleavage sites (`SCINPAS`)**: uses [SCINPAS](https://github.com/zavolanlab/SCINPAS/tree/making_CLI_interface_for_python_functions) functionalities to identify cleavage-and-polyadenylation sites in reads, and store this information in reads (see [nextflow.config](nextflow.config) for details)
+7. **Generation of `.bigwig` files** storing cleavage site support and mean (or median) polyA tail lengths for every cleavage site
+8. **De novo transcriptome assembly (`custom python`)**: Enriches input transcriptome annotations based on observed read alignments
+9. **Isoform Analysis (`custom python`)**: Assigns alignments to specific transcript isoforms.
+10. **QC: annotating raw current data from a subsample of reads**: 
+10.1 **Signal Extraction (`pod5`)**: Subsamples reads of interest and extracts corresponding raw signal chunks from `.pod5` files.
+10.2 **Move-table Emission (`dorado`)**: Re-processes subsetted reads to emit basecaller move-tables. 
+10.3 **Dataframe Generation & QC Visualization (`seaborn`)**: Synchronizes sequence strings with raw signal variations and renders PDF plots for individual reads.
 
 ## Quick Start
 

diff --git a/bin/annotate_motifs_in_bam.py b/bin/annotate_motifs_in_bam.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+import argparse
+import pysam
+import re
+
+def get_rc(seq):
+    return seq.translate(str.maketrans("ACGTUacgtuNn", "TGCAAtgcaaNn"))[::-1]
+
+def get_args():
+    parser = argparse.ArgumentParser(description="Annotate BAM with polyA motif presence.")
+    parser.add_argument('--bam_in', required=True)
+    parser.add_argument('--bam_out', required=True)
+    parser.add_argument('--fasta', required=True)
+    parser.add_argument('--motifs', nargs='+', required=True)
+    parser.add_argument('--window_up', type=int, default=40)
+    parser.add_argument('--window_down', type=int, default=20)
+    parser.add_argument('--tag_cs', default="XF")
+    parser.add_argument('--tag_motif', default="YF")
+    return parser.parse_args()
+
+def main():
+    args = get_args()
+    fasta = pysam.FastaFile(args.fasta)
+    bam_in = pysam.AlignmentFile(args.bam_in, "rb")
+    bam_out = pysam.AlignmentFile(args.bam_out, "wb", header=bam_in.header)
+
+    motif_regexes = {m: re.compile(f'(?=({m.upper().replace("U", "T")}))') for m in args.motifs}
+
+    # CACHE: Key: (chrom, strand, cs_pos) -> Value: "AAUAAA:(-21,-10);AUUAAA:(-35)"
+    cs_cache = {}
+
+    for read in bam_in:
+        if read.is_unmapped:
+            bam_out.write(read)
+            continue
+
+        try:
+            # cleavage site tag is expected to be 1-based
+            cs_pos = int(read.get_tag(args.tag_cs))
+        except KeyError:
+            bam_out.write(read)
+            continue
+
+        chrom = read.reference_name
+        strand = '-' if read.is_reverse else '+'
+        cache_key = (chrom, strand, cs_pos)
+
+        if cache_key in cs_cache:
+            motif_str = cs_cache[cache_key]
+        else:
+            try:
+                chrom_len = fasta.get_reference_length(chrom)
+            except KeyError:
+                cs_cache[cache_key] = ""
+                bam_out.write(read)
+                continue
+
+            # Convert to 0-based for precise FASTA fetching
+            cs_0based = cs_pos - 1
+
+            # Fetch strand-aware FASTA coordinates
+            if strand == '+':
+                fetch_start = cs_0based - args.window_up
+                fetch_end = cs_0based + args.window_down + 1
+            else:
+                fetch_start = cs_0based - args.window_down
+                fetch_end = cs_0based + args.window_up + 1
+
+            pad_left = max(0, -fetch_start)
+            pad_right = max(0, fetch_end - chrom_len)
+            safe_start = max(0, fetch_start)
+            safe_end = min(chrom_len, fetch_end)
+
+            try:
+                seq = fasta.fetch(chrom, safe_start, safe_end).upper()
+                seq = ("N" * pad_left) + seq + ("N" * pad_right)
+                if strand == '-':
+                    seq = get_rc(seq)
+
+                # Find Motifs
+                motif_dict = {}
+                for m, regex in motif_regexes.items():
+                    matches = [match.start() for match in regex.finditer(seq)]
+                    if matches:
+                        # Because of symmetric padding, the CS is always exactly at index args.window_up
+                        rel_positions = [str(idx - args.window_up) for idx in matches]
+                        motif_dict[m] = f"({','.join(rel_positions)})"
+
+                if motif_dict:
+                    # Format: AAUAAA:(-38,-10);AUUAAA:(-35)
+                    motif_str = ";".join([f"{k}:{v}" for k, v in motif_dict.items()])
+                else:
+                    motif_str = ""
+
+            except (KeyError, ValueError):
+                motif_str = ""
+
+            cs_cache[cache_key] = motif_str
+
+        # Apply tag and write
+        if motif_str:
+            read.set_tag(args.tag_motif, motif_str, value_type='Z')
+
+        bam_out.write(read)
+
+    bam_in.close()
+    bam_out.close()
+
+if __name__ == "__main__":
+    main()