Hi Salmon developers,
We are currently using Salmon to perform transcript quantification on long-read RNA-seq data (Oxford Nanopore). During our analysis, we observed that a substantial proportion of reads aligned to certain transcripts are relatively short and have low alignment identity and/or low coverage against the reference transcriptome/genome.
From manual inspection, many of these reads do not appear sufficiently reliable to be confidently considered as true transcript-supporting reads. However, Salmon still reports relatively high TPM/CPM values for some of these transcripts.
We would therefore like to better understand how Salmon converts mapped reads into transcript abundance estimates in long-read datasets.
Specifically, we are wondering:
Does Salmon apply any minimum threshold on:
read identity,
alignment score,
mapping quality,
aligned fraction/coverage,
before a read is included in quantification?
If no explicit threshold is applied, how are very low-similarity or partially aligned reads handled internally during abundance estimation?
In alignment-based mode, does Salmon use the original aligner’s filtering decisions entirely, or does it additionally weight/filter alignments based on alignment quality metrics?
Could a large number of low-identity reads artificially inflate TPM/CPM estimates, especially in noisy long-read datasets such as ONT direct RNA sequencing?
Are there recommended preprocessing or alignment filtering strategies before running Salmon on long-read data to avoid potential over-quantification from noisy reads?
For context, we are particularly concerned because many reads mapping to our target genes show relatively low genome/transcript identity upon inspection, yet the final quantified abundance remains unexpectedly high.
We would greatly appreciate any clarification regarding the internal quantification logic or best practices for handling noisy long-read data with Salmon.
Thank you very much for your help.
Hi Salmon developers,
We are currently using Salmon to perform transcript quantification on long-read RNA-seq data (Oxford Nanopore). During our analysis, we observed that a substantial proportion of reads aligned to certain transcripts are relatively short and have low alignment identity and/or low coverage against the reference transcriptome/genome.
From manual inspection, many of these reads do not appear sufficiently reliable to be confidently considered as true transcript-supporting reads. However, Salmon still reports relatively high TPM/CPM values for some of these transcripts.
We would therefore like to better understand how Salmon converts mapped reads into transcript abundance estimates in long-read datasets.
Specifically, we are wondering:
Does Salmon apply any minimum threshold on:
read identity,
alignment score,
mapping quality,
aligned fraction/coverage,
before a read is included in quantification?
If no explicit threshold is applied, how are very low-similarity or partially aligned reads handled internally during abundance estimation?
In alignment-based mode, does Salmon use the original aligner’s filtering decisions entirely, or does it additionally weight/filter alignments based on alignment quality metrics?
Could a large number of low-identity reads artificially inflate TPM/CPM estimates, especially in noisy long-read datasets such as ONT direct RNA sequencing?
Are there recommended preprocessing or alignment filtering strategies before running Salmon on long-read data to avoid potential over-quantification from noisy reads?
For context, we are particularly concerned because many reads mapping to our target genes show relatively low genome/transcript identity upon inspection, yet the final quantified abundance remains unexpectedly high.
We would greatly appreciate any clarification regarding the internal quantification logic or best practices for handling noisy long-read data with Salmon.
Thank you very much for your help.