A Snakemake pipeline to process ONT aa-tRNA-seq data.
This pipeline uses Pixi for environment management. Install it first:
curl -fsSL https://pixi.sh/install.sh | shSee the Pixi installation guide for alternative methods (Homebrew, Windows, etc.).
The pipeline can be configured by editing the config/config.yml file. The config file specifications will
run a small example dataset through the pipeline.
git clone https://github.com/rnabioco/aa-tRNA-seq-pipeline.git
cd aa-tRNA-seq-pipeline
# Install environment
pixi install
# One-time setup: download tools, models, and test data
pixi run setup
pixi run dl-test-data
# Dry run
pixi run dry-run
# Run pipeline locally with test data
pixi run testTo use on your own samples, create a config file and sample file in config/.
Create a TSV sample file with sample IDs and run paths:
sample1 /path/to/run1
sample2 /path/to/run2
Then create a config file pointing to it:
samples: config/samples.tsv
output_directory: "results"For pooled sequencing runs with WarpDemuX barcodes, use a YAML sample file:
runs:
- path: /path/to/pooled/run
barcode_kit: "WDX4_tRNA_rna004_v1_0"
samples:
charged_sample: "barcode03"
uncharged_sample: "barcode04"Enable demultiplexing in your config:
samples: config/samples-demux.yml
output_directory: "results"
warpdemux:
enabled: true
barcode_kit: "WDX4_tRNA_rna004_v1_0"Run with pixi:
pixi run snakemake --configfile=config/config-demux.yml --cores 8See README.md in the config directory for additional details on all configuration options.
flowchart TD
subgraph Input
POD5[POD5 files]
end
subgraph Demux [Optional Demultiplexing]
W[warpdemux<br/>barcode classification]
end
subgraph Processing
A[merge_pods] --> B[rebasecall<br/>Dorado + move tables]
B --> C[ubam_to_fastq]
C --> D[bwa_align<br/>tRNA + adapter reference]
end
subgraph Classification
D --> F[classify_charging<br/>Remora ML model]
B -.-> F
A -.-> F
F --> G[transfer_bam_tags]
end
subgraph Outputs
G --> H[charging_prob<br/>per-read ML scores]
G --> I[get_cca_trna_cpm<br/>CPM counts]
G --> J[bcerror<br/>basecalling errors]
G --> K[align_stats]
G --> L[modkit pileups]
L -.-> M[odds_ratios<br/>pairwise mod ORs]
H -.-> M
K -.-> N[qc_report<br/>Quarto HTML]
H -.-> N
end
POD5 -.-> W
W -.-> A
POD5 --> A
Given a directory of POD5 files, this pipeline:
- (Optional) Demultiplexes pooled runs using WarpDemuX barcode classification
- Merges all POD5 files per sample into a single file
- Rebasecalls with Dorado to generate unmapped BAM with move tables (required for Remora)
- Converts BAM to FASTQ and aligns to tRNA + adapter reference with BWA MEM
- Filters for full-length tRNA reads with proper adapter boundaries
- Classifies charged vs. uncharged reads using a Remora model trained on nanopore signal over the CCA 3' end
The classification generates ML tag values (0-255) indicating the likelihood of aminoacylation. By default, ML values of 200-255 are treated as charged, and values <200 as uncharged. This threshold can be adjusted via the ml-threshold parameter in the get_cca_trna_cpm rule.
The final steps of the pipeline calculate a number of outputs that may be useful for analysis and visualization, including normalized counts for charged and uncharged tRNA (get_cca_trna_cpm), basecalling error values (bcerror), alignment statistics (align_stats), information on raw nanopore signal from Remora (remora_signal_stats), per-tRNA pairwise modification odds ratios (compute_odds_ratios), reference sequence similarity QC (compute_reference_similarity), and a combined Quarto QC report (render_combined_qc_report).
A few notes about Remora classification for charged vs. uncharged tRNA reads
- this step retains only full length tRNA reads (with an allowance for signal loss at the 5´ end of nanopore direct RNA sequencing)
- Additionally, due to the iterative nature of sequencing method development, the present approach does not rely on differences in adapter sequences attached to charged vs. uncharged tRNA molecules (though these sequences are retained as separate entries in the alignment reference and downstream files). While we anticipate being able to leverage this information in the future, the current pipeline relies exclusively on signal data over a 6-nt modification kmer spanning the universal CCA 3′ end of tRNA and the first three nucleotides of the 3′ adapter (CCAGGC) to distinguish charged and uncharged reads.
The pipeline includes cluster profiles for LSF and SLURM schedulers.
# Run test data on LSF cluster
pixi run test-lsf
# Run test data on SLURM cluster
pixi run test-slurm
# Run full preprint analysis on cluster
pixi run run-preprintFor more details on configuring HPC jobs, see cluster/lsf/config.yaml or cluster/slurm/config.yaml.
To cite this work, see:
White LK, Radakovic A, Sajek MP, Dobson K, Riemondy KA, Del Pozo S, Szostak JW, Hesselberth JR. Nanopore sequencing of intact aminoacylated tRNAs. Nat Commun. 2025 Aug 20;16(1):7781. doi: 10.1038/s41467-025-62545-9. PMID: 40835813; PMCID: PMC12368100.