-
Notifications
You must be signed in to change notification settings - Fork 208
Getting alignment statistics with vg filter
The various mappers in vg (giraffe, map) create GAMs which include metadata about each alignment. In addition to the high-level statistics from vg stats -a, vg filter has a --tsv-out option to write a TSV with information about each read in a (possibly filtered subset of a) GAM.
The general syntax for using --tsv-out is:
vg filter --tsv-out FIELD mappings.gam > statistics.tsv
# Separate fields with semicolons & wrap in quotation marks
vg filter --tsv-out "FIELD1;FIELD2" mappings.gam > statistics.tsv
The output file is a TSV with a header line of column names. The first column name will have a # prefix. Non-header lines have the requested fields for a single read in the GAM.
Other vg filter options are still applied. For example, this command outputs name and score only for mapped reads whose names begin with hifi:
vg filter --name-prefix hifi --only-mapped \
--tsv-out "name;score" mappings.gam > statistics.tsv
Some statistics are pulled directly from the GAM, though not all GAM fields are available. Others are calculated on the fly from the information in the GAM. Statistics pulled from the GAM aren’t recalculated if missing. For example, unless --add-identity is used during vg inject, the resulting GAM won’t have an identity field. Asking vg filter to output the missing identity field will cause an error.
-
name: Read name (pulled from GAM) -
score: Alignment score (pulled from GAM) - note that several options invg filtercan affect score, such as--rescore,--frac-score, and--substitutions -
correctly_mapped:Trueif a read was correctly mapped,Falseotherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
correctness:correctif a read was correctly mapped,off_referenceif it was set to have no truth,incorrectotherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
softclip_start: number of base pairs soft-clipped off the beginning of a read (calculated on the fly) -
softclip_end: number of base pairs soft-clipped off the end of a read (calculated on the fly) - NOT the index of a soft-clip position -
cigar: the read's CIGAR string;Xis a mismatch and allMs are true matches -
identity: identity score of mapping (pulled from GAM) - calculated as (# matches) / (#matches + mismatches + insertions), ignoring soft clips -
is_perfect:1if an alignment is “perfect”, consisting of only matches and no mismatches, indels, or soft clips,0otherwise (calculated on the fly) -
mapping_quality: MQ score (pulled from GAM) -
sequence: read sequence (pulled from GAM) -
length: length of read sequence (pulled from GAM) -
time_used: time spent on mapping (pulled from GAM) -
annotation: any annotations (pulled from GAM) -
annotation.X: value of theXannotation (pulled from GAM)
Please request additional fields by opening an issue.