Skip to content

Clarification on how low-identity/short reads contribute to transcript quantification in Salmon (long-read data) #1009

@XikunDu615

Description

@XikunDu615

Hi Salmon developers,

We are currently using Salmon to perform transcript quantification on long-read RNA-seq data (Oxford Nanopore). During our analysis, we observed that a substantial proportion of reads aligned to certain transcripts are relatively short and have low alignment identity and/or low coverage against the reference transcriptome/genome.

From manual inspection, many of these reads do not appear sufficiently reliable to be confidently considered as true transcript-supporting reads. However, Salmon still reports relatively high TPM/CPM values for some of these transcripts.

We would therefore like to better understand how Salmon converts mapped reads into transcript abundance estimates in long-read datasets.

Specifically, we are wondering:

Does Salmon apply any minimum threshold on:
read identity,
alignment score,
mapping quality,
aligned fraction/coverage,
before a read is included in quantification?
If no explicit threshold is applied, how are very low-similarity or partially aligned reads handled internally during abundance estimation?
In alignment-based mode, does Salmon use the original aligner’s filtering decisions entirely, or does it additionally weight/filter alignments based on alignment quality metrics?
Could a large number of low-identity reads artificially inflate TPM/CPM estimates, especially in noisy long-read datasets such as ONT direct RNA sequencing?
Are there recommended preprocessing or alignment filtering strategies before running Salmon on long-read data to avoid potential over-quantification from noisy reads?

For context, we are particularly concerned because many reads mapping to our target genes show relatively low genome/transcript identity upon inspection, yet the final quantified abundance remains unexpectedly high.

We would greatly appreciate any clarification regarding the internal quantification logic or best practices for handling noisy long-read data with Salmon.

Thank you very much for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions