Skip to content

Question on CIGAR strings in UTA #266

@budsonjelmont

Description

@budsonjelmont

Lately I've been trying to understand how to interpret CIGAR strings in UTA and running into some confusion. This might just be due to some incorrect assumptions about CIGAR, but any advice is appreciated.

Here I have a query to UTA for an alignment that contains a 3bp deletion:

uta=> select cigar, tx_ac, alt_ac, ord, (tx_end_i - tx_start_i) as tx_ex_len, (alt_end_i - alt_start_i) as alt_ex_len  
from tx_exon_aln_v where tx_ac = 'NM_001256326.1' and cigar !~ '^[0-9]+=$' and alt_ac = 'NC_000017.10' order by ord;
   cigar   |     tx_ac      |    alt_ac    | ord | tx_ex_len | alt_ex_len 
-----------+----------------+--------------+-----+-----------+------------
 1453=3D2= | NM_001256326.1 | NC_000017.10 |  35 |      1458 |       1455

I've been assuming that this alignment means that there is a deletion of 3 bases in the transcript relative to to the genome (i.e. transcript is the "query", genome is the "reference"). However based on the tx_ex_len and alt_ex_len columns computed in that query, it seems I have this backwards: there are 1455 bases in the aligned region of the genome, and 1455+3 bases in the transcript's aligned region.

So in UTA's transcript-genome alignments, is the genome considered the "query" sequence and the transcript the "reference"? Meaning that, when reading CIGAR strings that are describing indels, should I be assuming that a deletion event means a deletion of bases from the genome that ARE present in the transcript (and vice versa for insertions)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions