-
Notifications
You must be signed in to change notification settings - Fork 208
vg manpage
Adam Novak edited this page Oct 16, 2025
·
8 revisions
% vg(1) | Variation Graph Toolkit
vg - variation graph tool, vg version v1.69.0 "Bologna".
vg is a toolkit for variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods.
For more in-depth explanations of tools and workflows, see the vg wiki page.
This is an incomplete list of vg subcommands. For a complete list, run vg help.
-
Graph construction and indexing
See the wiki page for an overview of vg indexes.
-
vg autoindex: automatically construct a graph and indexes for a specific workflow (e.g. giraffe, rpvg). wiki page -
vg construct: manually construct a graph from a reference and variants. wiki page -
vg index: manually build individual indexes (xg, distance, GCSA, etc). wiki page -
vg gbwt: manually build and manipulate GBWTs and indexes (GBWTgraph, GBZ, r-index). wiki page -
vg minimizer: manually build a minimizer index for mapping. -
vg haplotypes: haplotype sample a graph. Recommended for mapping with giraffe. wiki page
-
- Read mapping
- Downstream analyses
-
Working with read alignments
-
vg gamsort: sort a GAM/GAF file or index a sorted GAM file. -
vg filter: filter alignments by properties. -
vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram). -
vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf). -
vg sim: simulate reads from a graph. wiki page
-
- Graph and read statistics
- Manipulating a graph
-
Conversion between formats
-
vg convert: convert between handle graph formats and GFA, and between alignment formats. -
vg view: convert between non-handle graph formats and alignment formats (dot, json, turtle...). -
vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram). -
vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf). -
vg paths: extract a fasta from a graph. wiki page
-
- Subgraph extraction
usage: vg annotate [options] >output.{gam,vg,tsv}
graph annotation options:
-x, --xg-name FILE xg index or graph to annotate (required)
-b, --bed-name FILE BED file to convert to GAM (may repeat)
-f, --gff-name FILE GFF3 file to convert to GAM (may repeat)
-g, --ggff output GGFF subgraph annotation file
instead of GAM (requires -s)
-F, --gaf-output output in GAF format rather than GAM
-s, --snarls FILE snarls to expand GFF intervals into
alignment annotation options:
-a, --gam FILE alignments to annotate (required)
-x, --xg-name FILE xg index of the graph against which the
alignments are aligned (required)
-p, --positions annotate alignments with reference positions
-m, --multi-position annotate alignments with multiple reference positions
-l, --search-limit N when annotating with -p, search this far for paths, or
-1 to not search [0 (auto from read length)]
-b, --bed-name FILE annotate alignments with overlapping region names
from this BED (may repeat)
-n, --novelty output TSV table with header
describing how much of each Alignment is novel
-P, --progress show progress
-t, --threads N use the specified number of threads
-h, --help print this help message to stderr and exit
usage: vg autoindex [options]
output:
-p, --prefix PREFIX prefix to use for all output [index]
-w, --workflow NAME workflow to produce indexes for (may repeat) [map]
{map, mpmap, rpvg, giraffe, sr-giraffe, lr-giraffe}
input data:
-r, --ref-fasta FILE FASTA file with the reference sequence (may repeat)
-v, --vcf FILE VCF file with sequence names matching -r (may repeat)
-i, --ins-fasta FILE FASTA file with sequences of INS variants from -v
-g, --gfa FILE GFA file to make a graph from
-G, --gbz FILE GBZ file to make indexes from
-x, --tx-gff FILE GTF/GFF file with transcript annotations (may repeat)
-H, --hap-tx-gff FILE GTF/GFF file with transcript annotations
of a named haplotype (may repeat)
-n, --no-guessing do not guess that pre-existing files are indexes
i.e. force-regenerate any index not explicitly provided
configuration:
-f, --gff-feature STR GTF/GFF feature type (col. 3) to add to graph [exon]
-a, --gff-tx-tag STR GTF/GFF tag (in col. 9) for ID [transcript_id]
logging and computation:
-T, --tmp-dir DIR temporary directory to use for intermediate files
-M, --target-mem MEM target max memory usage (not exact, formatted INT[kMG])
[1/2 of available]
-t, --threads NUM number of threads [all available]
-V, --verbosity NUM log to stderr {0 = none, 1 = basic, 2 = debug}[1]
-h, --help print this help message to stderr and exit
usage: vg call [options] <graph> > output.vcf
Call variants or genotype known variants
support calling options:
-k, --pack FILE supports created from vg pack for given input graph
-m, --min-support M,N min allele (M) and site (N) support to call [2,4]
-e, --baseline-error X,Y baseline error rates for Poisson model for small (X)
and large (Y) variants [0.005,0.01]
-B, --bias-mode use old ratio-based genotyping algorithm
as opposed to probablistic model
-b, --het-bias M,N homozygous alt/ref allele must have >= M/N times
more support than the next best allele [6,6]
GAF options:
-G, --gaf output GAF genotypes instead of VCF
-T, --traversals output all candidate traversals in GAF
without doing any genotyping
-M, --trav-padding N extend each flank of traversals (from -T) with
reference path by N bases if possible
general options:
-v, --vcf FILE VCF file to genotype (must have been used
to construct input graph with -a)
-a, --genotype-snarls genotype every snarl, including reference calls
(use to compare multiple samples)
-A, --all-snarls genotype all snarls, including nested child snarls
(like deconstruct -a)
-c, --min-length N genotype only snarls with
at least one traversal of length >= N
-C, --max-length N genotype only snarls where
all traversals have length <= N
-f, --ref-fasta FILE reference FASTA
(required if VCF has symbolic deletions/inversions)
-i, --ins-fasta FILE insertions (required if VCF has symbolic insertions)
-s, --sample NAME sample name [SAMPLE]
-r, --snarls FILE snarls (from vg snarls) to avoid recomputing.
-g, --gbwt FILE only call genotypes present in given GBWT index
-z, --gbz only call genotypes present in GBZ index
(applies only if input graph is GBZ)
-N, --translation FILE node ID translation (from vg gbwt --translation)
to apply to snarl names in output
-O, --gbz-translation use the ID translation from the input GBZ to
apply snarl names to snarl names/AT fields in output
-p, --ref-path NAME reference path to call on (may repeat; default all)
-S, --ref-sample NAME call on all paths with this sample
(cannot use with -p)
-o, --ref-offset N offset in reference path (may repeat; 1 per path)
-l, --ref-length N override reference length for output VCF contig
-d, --ploidy N ploidy of sample. {1, 2} [2]
-R, --ploidy-regex RULES use this comma-separated list of colon-delimited
REGEX:PLOIDY rules to assign ploidies to contigs
not visited by the selected samples, or to all
contigs simulated from if no samples are used.
Unmatched contigs get ploidy 2 (or that from -d).
-n, --nested activate nested calling mode (experimental)
-I, --chains call chains instead of snarls (experimental)
--progress show progress
-t, --threads N number of threads to use
-h, --help print this help message to stderr and exit
usage: vg chunk [options] > [chunk.vg]
Splits a graph and/or alignment into smaller chunks
Graph chunks are saved to .vg files, read chunks are saved to .gam files,
and haplotype annotations are saved to .annotate.txt files, of the form
<BASENAME>-<i>-<region name or "ids">-<start>-<length>.<ext>.
The BASENAME is specified with -b and defaults to "./chunks".
For a single-range chunk (-p or -r), the graph data is sent to
standard output instead of a file.
options:
-x, --xg-name FILE use this graph or xg index to chunk subgraphs
-G, --gbwt-name FILE use this GBWT haplotype index
for haplotype extraction (for -T)
-a, --aln-name FILE chunk alignments instead of graph (may repeat)
-g, --aln-and-graph when used in combination with -a,
both alignments and graph will be chunked
-F, --in-gaf -a alignment is a sorted bgzipped GAF, not GAM
path chunking:
-p, --path TARGET write the chunk in the specified path range
(0-based inclusive, multiple allowed)
TARGET=path[:pos1[-pos2]] to standard output
-P, --path-list FILE for all paths in line separated file,
write chunks for each as in -p
-e, --input-bed FILE write chunks for (0-based end-exclusive) regions
-S, --snarls FILE write given path-range(s) and all snarls
fully contained in them, as alternative to -c
id range chunking:
-r, --node-range N:M write the chunk for this node range to stdout
-R, --node-ranges FILE write the chunk for each node range in
(newline or whitespace separated) file
-n, --n-chunks N generate N id-range chunks, determined via xg
simple alignment chunking:
-m, --aln-split-size N split alignments (-a, sort/index not required)
up into chunks with at most N reads each
component chunking:
-C, --components create a chunk for each connected component.
If targets given with (-p, -P, -r, -R),
limit to components containing them
-M, --path-components create a chunk for each path
in the graph's connected component
general:
-s, --chunk-size N create chunks spanning N bases
(or nodes with -r/-R) for all input regions.
-o, --overlap N overlap between chunks when using -s [0]
-E, --output-bed FILE write all created chunks to a bed file
-b, --prefix BASENAME write output chunk files [./chunk]
Files for chunk i will be named
<BASENAME>-<i>-<name>-<start>-<length>.<ext>
-c, --context-steps N expand the context of the chunk N node steps [1]
-l, --context-length N expand the context of the chunk by N bp [0]
-T, --trace trace haplotype threads in chunks
(and only expand forward from input coordinates)
Produces .annotate.txt file
with haplotype frequencies for each chunk.
--no-embedded-haplotypes don't load haplotypes from the graph. It is
possible to -T without any haplotypes available.
-f, --fully-contained only return GAM alignments that are
fully contained within chunk
-u, --cut-alignments cut alignments to be fully within the chunk
-O, --output-fmt STR output format {vg, pg, hg, gfa} [pg (vg for -T)]
-t, --threads N for parallel tasks, use this many threads [1]
-h, --help print this help message to stderr and exit
usage: vg construct [options] >new.vg
options:
construct from a reference and variant calls:
-r, --reference FILE input FASTA reference (may repeat)
-v, --vcf FILE input VCF (may repeat)
-n, --rename V=F match contig V in the VCFs to contig F in the FASTAs
(may repeat)
-a, --alt-paths save paths for alts of variants by SHA1 hash
-A, --alt-paths-plain save paths for alts of variants by variant ID
if possible, otherwise SHA1
(IDs must be unique across all input VCFs)
-R, --region REGION specify a VCF contig name or 1-based inclusive region
(may repeat, if on different contigs)
-C, --region-is-chrom don't attempt to parse the regions (use when reference
sequence name could be parsed as a region)
-z, --region-size N variants per region to parallelize [1024]
-t, --threads N use N threads to construct graph [numCPUs]
-S, --handle-sv include structural variants in construction of graph.
-I, --insertions FILE a FASTA file containing insertion sequences
(referred to in VCF) to add to graph.
-f, --flat-alts don't chop up alternate alleles from input VCF
-l, --parse-max N don't chop up alternate alleles from input VCF
longer than N [100]
-i, --no-trim-indels don't remove the 1bp ref base from indel alt alleles
-N, --in-memory construct entire graph in memory before outputting it
construct from a multiple sequence alignment:
-M, --msa FILE input multiple sequence alignment
-F, --msa-format STR format of the MSA file {fasta, clustal} [fasta]
-d, --drop-msa-paths don't add paths for the MSA sequences into the graph
shared construction options:
-m, --node-max N limit maximum allowable node sequence size [32]
nodes greater than this threshold will be divided
note: nodes larger than ~1024 bp can't be GCSA2-indexed
-p, --progress show progress
-h, --help print this help message to stderr and exit
usage: vg convert [options] <input-graph>
input options:
-g, --gfa-in input in GFA format
-r, --in-rgfa-rank N import rgfa tags with rank <= N as paths [0]
-b, --gbwt-in FILE input graph is a GBWTGraph using the GBWT in FILE
--ref-sample STR change haplotypes for this sample to
reference paths (may repeat)
--hap-locus STR change generic paths with this locus
to haplotype paths (must be used with --new-sample)
--new-sample STR when using --hap-locus, give the new haplotype
this sample name (must be used with --hap-locus)
gfa input options (use with -g):
-T, --gfa-trans FILE write gfa id conversions to FILE
output options:
-v, --vg-out output in VG's original Protobuf format
[DEPRECATED: use -p instead].
-a, --hash-out output in HashGraph format
-p, --packed-out output in PackedGraph format (default)
-x, --xg-out output in XG format
-f, --gfa-out output in GFA format
-H, --drop-haplotypes do not include haplotype paths in the output
(useful with GBWTGraph / GBZ inputs)
gfa output options (use with -f):
-P, --rgfa-path STR write given path as rGFA tags instead of lines
(may repeat, only rank-0 supported)
-Q, --rgfa-prefix STR write paths with this prefix as rGFA tags instead
of lines (may repeat, only rank-0 supported)
-B, --rgfa-pline paths written as rGFA tags also written as lines
-W, --no-wline write all paths as GFA P-lines instead of W-lines.
allows handling multiple phase blocks
and subranges used together.
--gbwtgraph-algorithm always use the GBWTGraph library GFA algorithm.
not compatible with other GFA output options
or non-GBWT graphs.
--vg-algorithm always use the VG GFA algorithm. Works with all
options and graph types, but can't preserve
original GFA coordinates
--no-translation when using the GBWTGraph algorithm, convert graph
directly to GFA; do not use the translation
to preserve original coordinates
alignment options:
-G, --gam-to-gaf FILE convert GAM FILE to GAF
-F, --gaf-to-gam FILE convert GAF FILE to GAM
general options:
-t, --threads N use N threads [numCPUs]
-h, --help print this help message to stderr and exit
usage: vg deconstruct [options] [-p|-P] <PATH> <GRAPH>
Output VCF records for Snarls present in a graph (relative to a reference path).
options:
-p, --path NAME a reference path to deconstruct against (may repeat).
-P, --path-prefix NAME all paths [minus GBWT threads / non-ref GBZ paths]
beginning with NAME used as reference (may repeat).
other non-ref paths not considered as samples.
-r, --snarls FILE snarls file (from vg snarls) to avoid recomputing.
-g, --gbwt FILE consider alt traversals for GBWT haplotypes in FILE
(not needed for GBZ graph input).
-T, --translation FILE node ID translation (from vg gbwt --translation)
to apply to snarl names and AT fields in output
-O, --gbz-translation use the ID translation from the input GBZ to apply
snarl names to snarl names and AT fields in output
-a, --all-snarls process all snarls, including nested snarls
(by default only top-level snarls reported).
-c, --context-jaccard N set context mapping size used to disambiguate alleles
at sites with multiple reference traversals [10000]
-u, --untangle-travs use context mapping fpr reference-relative positions
of each step in allele traversals (AP INFO field).
-K, --keep-conflicted retain conflicted genotypes in output.
-S, --strict-conflicts drop genotypes when we have more than one haplotype
for any given phase (set by default for GBWT input).
-C, --contig-only-ref only use CONTIG name (not SAMPLE#CONTIG#HAPLOTYPE)
for reference if possible (i.e. only one ref sample)
-L, --cluster F cluster traversals whose (handle) Jaccard coefficient
is >= F together [1.0; experimental]
-n, --nested write a nested VCF, plus special tags [experimental]
-R, --star-allele use *-alleles to denote alleles that span
but do not cross the site. Only works with -n
-t, --threads N use N threads
-v, --verbose print some status messages
-h, --help print this help message to stderr and exit
usage: vg filter [options] <alignment.gam> > out.gam
Filter alignments by properties.
options:
-M, --input-mp-alns input is multipath alignments (GAMP), not GAM
-n, --name-prefix NAME keep only reads with this name prefix ['']
-N, --name-prefixes FILE keep reads with names with any of these prefixes,
one per nonempty line
-e, --exact-name match read names exactly instead of by prefix
-a, --subsequence NAME keep reads that contain this subsequence
-A, --subsequences FILE keep reads that contain one of these subsequences
one per nonempty line
-p, --proper-pairs keep reads annotated as being properly paired
-P, --only-mapped keep reads that are mapped
-X, --exclude-contig REGEX drop reads with refpos annotations on contigs
matching the given regex (may repeat)
-F, --exclude-feature NAME drop reads with the given feature
in the "features" annotation (may repeat)
-s, --min-secondary N minimum score to keep secondary alignment
-r, --min-primary N minimum score to keep primary alignment
-L, --max-length N drop reads with length > N
-O, --rescore re-score reads using default parameters
and only alignment information
-f, --frac-score normalize score based on length
-u, --substitutions use substitution count instead of score
-W, --overwrite-score replace stored GAM score with computed/normalized
score
-o, --max-overhang N drop reads whose alignments begin or end
with an insert > N [99999]
-m, --min-end-matches N drop reads without >=N matches on each end
-S, --drop-split remove split reads taking nonexistent edges
-x, --xg-name FILE use this xg index/graph (required for -S and -D)
-v, --verbose print out statistics on numbers of reads dropped
-V, --no-output print out -v statistics and do not write the GAM
-T, --tsv-out FIELD[;FIELD] write TSV of given fields instead of filtered GAM
See wiki page:
"Getting alignment statistics with vg filter"
-q, --min-mapq N drop alignments with mapping quality < N
-E, --repeat-ends N drop reads with tandem repeat (motif size <= 2N,
spanning >= N bases) at either end
-D, --defray-ends N clip back the ends of ambiguously aligned reads
up to N bases
-C, --defray-count N stop defraying after N nodes visited
(used to keep runtime in check) [99999]
-d, --downsample S.P drop all but the given portion 0.P of the reads.
S may be an integer seed as in SAMtools
-R, --max-reads N drop all but N reads. Use on a single thread
-i, --interleaved both ends will be dropped if either fails filter
assume interleaved input
-I, --interleaved-all both ends will be dropped if *both* fail filters
assume interleaved input
-b, --min-base-quality Q:F drop reads with where fewer than fraction F bases
have base quality >= PHRED score Q.
-G, --annotation K[:V] keep reads if the annotation is present and
not false/empty. If a value is given, keep reads
if the values are equal similar to running
jq 'select(.annotation.K==V)' on the json
-c, --correctly-mapped keep only reads marked as correctly-mapped
-l, --first-alignment keep only the first alignment for each read
Must be run with 1 thread
-U, --complement apply opposite of the filter from other arguments
-B, --batch-size N work in batches of N reads [512]
-t, --threads N number of threads [1]
--progress show progress
-h, --help print this help message to stderr and exit
usage: vg find [options] >sub.vg
options:
-h, --help print this help message to stderr and exit
graph features:
-x, --xg-name FILE use this xg index or graph (instead of rocksdb db)
-n, --node ID find node(s), return 1-hop context as graph
-N, --node-list FILE whitespace or line delimited list of nodes to grab
--mapping FILE include nodes mapping to the selected node IDs
-e, --edges-end ID return edges on end of node with ID
-s, --edges-start ID return edges on start of node with ID
-c, --context STEPS expand the context of the subgraph this many steps
-L, --use-length treat STEPS in -c or M in -r as a length in bases
-P, --position-in PATH find the position of -n node in the given path
-I, --list-paths write out the path names in the index
-r, --node-range N:M get nodes from N to M
-G, --gam GAM accumulate the graph touched by GAM's alignments
--connecting-start POS find graph from POS (node ID, + or -, node offset)
connecting to --connecting-end
--connecting-end POS find graph to POS (node ID, + or -, node offset)
connecting from --connecting-start
--connecting-range INT traverse up to INT bases when going
from --connecting-start to --connecting-end [100]
subgraphs by path range:
-p, --path TARGET find the node(s) in the specified path range(s)
TARGET=path[:pos1[-pos2]]
-R, --path-bed FILE read our targets from the given BED FILE
-E, --path-dag with -p or -R, gets any node in the partial order
from pos1 to pos2, assumes id sorted DAG
-W, --save-to PREFIX instead of writing target subgraphs to stdout,
write one per given target to a separate file
named PREFIX[path]:[start]-[end].vg
-K, --subgraph-k K instead of graphs, write kmers from the subgraphs
-H, --gbwt FILE when enumerating kmers from subgraphs, determine
their frequencies in this GBWT haplotype index
alignments:
-l, --sorted-gam FILE use this sorted, indexed GAM file
-F, --sorted-gaf FILE use this sorted, indexed GAF file
-o, --alns-on N:M write alignments which align to any of the
nodes between N and M (inclusive)
-A, --to-graph VG get alignments to the provided subgraph
sequences:
-g, --gcsa FILE use this GCSA2 index of the graph's sequence space
(required for sequence queries)
-S, --sequence STR search for sequence STR using
-M, --mems STR describe the super-maximal exact matches
of the STR (GCSA2) in JSON
-B, --reseed-length N find non-super-maximal MEMs inside SMEMs length>=N
-f, --fast-reseed use fast SMEM reseeding algorithm
-Y, --max-mem N maximum length of the MEM [GCSA2 order]
-Z, --min-mem N minimum length of the MEM [1]
-D, --distance return distance on path between pair of nodes (-n)
if -P not used, best path chosen heurstically
-Q, --paths-named STR return all paths with name prefix STR (may repeat)
usage: vg gamsort [options] input > output
Sort a GAM/GAF file, or index a sorted GAM file.
General options:
-p, --progress show progress
-s, --shuffle shuffle reads by hash
-t, --threads N use N worker threads [4 for GAM, 1 for GAF]
GAM sorting options:
-i, --index FILE produce an index of the sorted GAM file
-d, --dumb-sort use naive sorting algorithm
(no tmp files, faster for small GAMs)
GAF sorting options:
-G, --gaf-input input is a GAF file
-c, --chunk-size N number of reads per chunk [1000000]
-m, --merge-width N number of files to merge at once [32]
-S, --stable use stable sorting
-g, --gbwt-output FILE write a GBWT index of the paths to FILE
-b, --bidirectional make the GBWT index bidirectional
-h, --help print this help message to stderr and exit
usage: vg gbwt [options] [args]
Manipulate GBWTs. Input GBWTs are loaded from input args
or built in earlier steps. See wiki page "VG GBWT Subcommand".
The input graph is provided with one of -x, -G, or -Z
General options:
-h, --help print this help message to stderr and exit
-x, --xg-name FILE read the graph from FILE
-o, --output FILE write output GBWT to FILE
-d, --temp-dir DIR use directory DIR for temporary files
-p, --progress show progress and statistics
GBWT construction parameters (for steps 1 and 4):
--buffer-size N construction buffer size in millions of nodes[100]
--id-interval N store path IDs at 1/N positions [1024]
Multithreading:
--num-jobs N use at most N parallel build jobs
(for -v, -G, -A, -l, -P) [4]
--num-threads N use N parallel search threads
(for -b and -r) [8]
Step 1: GBWT construction (requires -o and one of { -v, -G, -Z, -E, A }):
-v, --vcf-input index the haplotypes in the VCF files specified in
input args in parallel (requires -x, implies -f);
(inputs must be over different contigs,
does not store graph contigs in the GBWT)
--preset X use preset X (available: 1000gp)
--inputs-as-jobs create one build job for each input
instead of using first-fit heuristic
--parse-only store the VCF parses without building GBWTs
(use -o for file name prefix; skips later steps)
--ignore-missing don't warn when variants are missing from the graph
--actual-phasing don't treat unphased homozygous genotypes as phased
--force-phasing replace unphased genotypes with randomly phased ones
--discard-overlaps skip overlapping alternate alleles if the overlap
cannot be resolved instead of creating a phase break
--batch-size N index the haplotypes in batches of N samples [200]
--sample-range X-Y index samples X to Y (inclusive, 0-based)
--rename V=P VCF contig V matches path P in the graph (may repeat)
--vcf-variants variants in graph use VCF contig names, not path names
--vcf-region C:X-Y restrict VCF contig C to coordinates X to Y
(inclusive, 1-based; may repeat)
--exclude-sample X do not index the sample with name X
(faster than -R; may repeat)
-G, --gfa-input index walks or paths in the GFA file (one input arg)
--max-node N chop long segments into nodes of at most N bp
(use 0 to disable) [1024]
--path-regex X parse metadata as haplotypes from path names
using regex X instead of vg-parser-compatible rules
--path-fields X parse metadata as haplotypes, mapping regex submatches
to these fields instead of vg-parser-compatible rules
--translation FILE write the segment to node translation table to FILE
-Z, --gbz-input extract GBWT & GBWTGraph from GBZ from (one) input arg
-I, --gg-in FILE load GBWTGraph from FILE and GBWT from (one) input arg
-E, --index-paths index the embedded non-alt paths in the graph
(requires -x, no input args)
-A, --alignment-input index the alignments in the GAF files specified
in input args (requires -x)
--gam-format input files are in GAM format instead of GAF format
Step 2: Merge multiple input GBWTs (requires -o):
-m, --merge use the insertion algorithm
-f, --fast fast merging algorithm (node ids must not overlap)
-b, --parallel use the parallel algorithm
--chunk-size N search in chunks of N sequences [1]
--pos-buffer N use N MiB position buffers for each search thread [64]
--thread-buffer N use N MiB thread buffers for each search thread [256]
--merge-buffers N merge 2^N thread buffers into one file per merge [6]
--merge-jobs N run N parallel merge jobs [4]
Step 3: Alter GBWT (requires -o and one input GBWT):
-R, --remove-sample X remove sample X from the index (may repeat)
--set-tag K=V set a GBWT tag (may repeat)
--set-reference X set sample X as the reference (may repeat)
Step 4: Path cover GBWT construction
(requires an input graph, -o, and one of { -a, -l, -P }):
-a, --augment-gbwt add path cover of missing components (one input GBWT)
-l, --local-haplotypes sample local haplotypes (one input GBWT)
-P, --path-cover build a greedy path cover (no input GBWTs)
-n, --num-paths N find N paths per component[64 for -l, 16 otherwise]
-k, --context-length N use N-node contexts [4]
--pass-paths include named graph paths in local haplotype
or greedy path cover GBWT
Step 5: GBWTGraph construction (requires an input graph and one input GBWT):
-g, --graph-name FILE build GBWTGraph and store it in FILE
--gbz-format serialize both GBWT and GBWTGraph in GBZ format
(makes -o unnecessary)
Step 6: R-index construction (one input GBWT):
-r, --r-index FILE build an r-index and store it in FILE
Step 7: Metadata (one input GBWT):
-M, --metadata print basic metadata
-C, --contigs print the number of contigs
-H, --haplotypes print the number of haplotypes
-S, --samples print the number of samples
-L, --list-names list contig/sample names (use with -C or -S)
-T, --path-names list path names
--tags list GBWT tags
Step 8: Paths (one input GBWT):
-c, --count-paths print the number of paths
-e, --extract FILE extract paths in SDSL format to FILE
usage:
vg giraffe -Z graph.gbz [-d graph.dist [-m graph.withzip.min -z graph.zipcodes]] <input options> [other options] > output.gam
vg giraffe -Z graph.gbz --haplotype-name graph.hapl --kff-name sample.kff <input options> [other options] > output.gam
Fast haplotype-aware read mapper.
basic options:
-Z, --gbz-name FILE map to this GBZ graph
-m, --minimizer-name FILE use this minimizer index
-z, --zipcode-name FILE use these additional distance hints
-d, --dist-name FILE cluster using this distance index
-p, --progress show progress
-t, --threads N number of mapping threads to use
-b, --parameter-preset NAME set computational parameters [default]
(chaining-sr / default / fast / hifi / r10 / srold)
-h, --help print full help with all available options
input options:
-G, --gam-in FILE read and realign these GAM-format reads
-f, --fastq-in FILE read and align these FASTQ/FASTA-format reads
(two are allowed, one for each mate)
-i, --interleaved GAM/FASTQ/FASTA input is interleaved pairs,
for paired-end alignment
--comments-as-tags treat comments in name lines as SAM-style tags
and annotate alignments with them
haplotype sampling:
--haplotype-name FILE sample from haplotype information in FILE
--kff-name FILE sample according to kmer counts in FILE
--index-basename STR name prefix for generated graph/index files
(default: from graph name)
--set-reference STR include this sample as a reference
in the personalized graph (may repeat)
alternate graphs:
-x, --xg-name FILE map to this graph (if no -Z / -g),
or use this graph for HTSLib output
-g, --graph-name FILE map to this GBWTGraph (if no -Z)
-H, --gbwt-name FILE use this GBWT index (when mapping to -x / -g)
output options:
-N, --sample NAME add this sample name
-R, --read-group NAME add this read group
-o, --output-format NAME output the alignments in NAME format [gam]
{gam / gaf / json / tsv / SAM / BAM / CRAM}
--ref-paths FILE ordered list of paths in the graph, one per line
or HTSlib .dict, for HTSLib @SQ headers
--ref-name NAME name of reference in the graph for HTSlib output
--named-coordinates make GAM/GAF output in named-segment (GFA) space
usage:
vg haplotypes [options] -k kmers.kff -g output.gbz graph.gbz
vg haplotypes [options] -H output.hapl graph.gbz
vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz
Haplotype sampling based on kmer counts.
Output files:
-g, --gbz-output FILE write the output GBZ to file (requires -k)
-H, --haplotype-output FILE write haplotype information to file
Input files:
-d, --distance-index FILE use this distance index [<basename>.dist]
-r, --r-index FILE use this r-index [<basename>.ri]
-i, --haplotype-input FILE use this .hapl file (default: generate)
-k, --kmer-input FILE use kmer counts from this KFF file
Options for generating haplotype information:
--kmer-length N kmer length for building minimizer index[29]
--window-length N window length for building minimizer index [11]
--subchain-length N target length (in bp) for subchains [10000]
--linear-structure extend subchains to avoid haplotypes
visiting them multiple times
Options for sampling haplotypes:
--preset STR use preset X {default, haploid, diploid}
--coverage N kmer coverage in KFF file (default: estimate)
--num-haplotypes N generate N haplotypes [4]
with --diploid-sampling, use N candidates [32]
--present-discount F discount scores for present kmers by factor F
[0.9]
--het-adjustment F adjust scores for heterozygous kmers by F [0.05]
--absent-score F score absent kmers -F/+F [0.8]
--haploid-scoring use a scoring model without heterozygous kmers
--diploid-sampling choose the best pair from the sampled haplotypes
--extra-fragments select all candidates in bad subchains
in --diploid-sampling
--badness F threshold for badness of a subchain [4]
--include-reference include named and reference paths in the output
--set-reference NAME use sample X as a reference sample (may repeat)
Other options:
-v, --verbosity N verbosity level [0]
{0 = silent, 1 = basic, 2 = detailed, 3 = debug}
-t, --threads N approximate number of threads [8 on this system]
-h, --help print this help message to stderr and exit
usage: vg ids [options] <graph1.vg> [graph2.vg ...] >new.vg
options:
-c, --compact minimize the space of integers used by the ids
-i, --increment N increase ids by N
-d, --decrement N decrease ids by N
-j, --join make a joint ID space for all supplied graphs
by iterating through the supplied graphs and incrementing
their ids to be non-conflicting (modifies original files)
-m, --mapping FILE create an empty node mapping for vg prune
-s, --sort assign new node IDs in generalized topological sort order
-h, --help print this help message to stderr and exit
usage: vg index [options] <graph1.vg> [graph2.vg ...]
Creates an index on the specified graph or graphs. All graphs indexed must
already be in a joint ID space.
general options:
-h, --help print this help message to stderr and exit
-b, --temp-dir DIR use DIR for temporary files
-t, --threads N number of threads to use
-p, --progress show progress
xg options:
-x, --xg-name FILE use this file to store a succinct, queryable version
of graph(s), or read for GCSA or distance indexing
-L, --xg-alts include alt paths in xg
gcsa options:
-g, --gcsa-out FILE output a GCSA2 index to the given file
-f, --mapping FILE use this node mapping in GCSA2 construction
-k, --kmer-size N index kmers of size N in the graph [16]
-X, --doubling-steps N use N doubling steps for GCSA2 construction [4]
-Z, --size-limit N limit temp disk space usage to N GB [2048]
-V, --verify-index validate the GCSA2 index using the input kmers
(important for testing)
gam indexing options:
-l, --index-sorted-gam input is sorted .gam format alignments,
store a GAI index of the sorted GAM in INPUT.gam.gai
vg in-place indexing options:
--index-sorted-vg input is ID-sorted .vg format graph chunks
store a VGI index of the sorted vg in INPUT.vg.vgi
snarl distance index options
-j, --dist-name FILE use this file to store a snarl-based distance index
--snarl-limit N don't store distances for snarls > N nodes [10000]
if 0 then don't store distances, only the snarl tree
--no-nested-distance only store distances along the top-level chain
-w, --upweight-node N upweight the node with ID N to push it to be part
of a top-level chain (may repeat)
usage: vg inject -x graph.xg [options] input.[bam|sam|cram] >output.gam
options:
-x, --xg-name FILE use this graph or xg index (required, non-XG okay)
-i, --add-identity calculate & add 'identity' statistic to output GAM
-r, --rescore re-score alignments
-o, --output-format NAME output alignment format {gam / gaf / json} [gam]
-t, --threads N number of threads to use
-h, --help print this help message to stderr and exit
usage: vg map [options] -d idxbase -f in1.fq [-f in2.fq] >aln.gam
Align reads to a graph.
graph/index:
-d, --base-name BASE use BASE.xg and BASE.gcsa as input indexes
-x, --xg-name FILE use this xg index or graph [<graph>.vg.xg]
-g, --gcsa-name FILE use this GCSA2 index [<graph>.gcsa]
-1, --gbwt-name FILE use this GBWT haplotype index [<graph>.gbwt]
algorithm:
-t, --threads N number of compute threads to use
-k, --min-mem INT minimum MEM length (if 0 estimate via -e) [0]
-e, --mem-chance FLOAT this fraction of -k length hits
will be by chance [5e-4]
-c, --hit-max N ignore MEMs who have >N hits in our index
(0 for no limit) [2048]
-Y, --max-mem INT ignore MEMs longer than INT (unset if 0) [0]
-r, --reseed-x FLOAT look for internal seeds inside a seed
longer than FLOAT*--min-seed [1.5]
-u, --try-up-to INT attempt to align up to the INT best candidate
chains of seeds (1/2 for paired) [128]
-l, --try-at-least INT attempt to align at least the INT best
candidate chains of seeds [1]
-E, --approx-mq-cap INT weight MQ by suffix tree based estimate
when estimate less than FLOAT [0]
-7, --id-mq-weight N scale mapping quality by the alignment score
identity to this power [2]
-W, --min-chain INT discard a chain if seeded bases are
shorter than INT [0]
-C, --drop-chain FLOAT drop chains shorter than FLOAT fraction of
the longest overlapping chain [0.45]
-n, --mq-overlap FLOAT scale MQ by count of alignments with FLOAT
overlap in the query with the primary [0]
-P, --min-ident FLOAT accept alignment only if the alignment
identity is >= FLOAT [0]
-H, --max-target-x N skip cluster subgraphs with
length > N*read_length [100]
-w, --band-width INT band width for long read alignment [256]
-O, --band-overlap INT band overlap for long read alignment [{-w}/8]
-J, --band-jump INT the maximum number of bands of insertion we
consider in the alignment chain model [128]
-B, --band-multi INT consider this many alignments of each band
in banded alignment [16]
-Z, --band-min-mq INT treat bands with < INT MQ as unaligned [0]
-I, --fragment STR fragment length distribution specification
STR=m:μ:σ:o:d [5000:0:0:0:1]
max:mean:stdev:orientation (1=same/0=flip):
direction (1=forward, 0=backward)
-U, --fixed-frag-model don't learn the pair fragment model online,
use -I without update
-p, --print-frag-model suppress alignment output and print the
fragment model on stdout as per -I format
-4, --frag-calc INT update the fragment model
every INT perfect pairs [10]
-3, --fragment-x FLOAT calculate max fragment size as
frag_mean+frag_sd*FLOAT [10]
-0, --mate-rescues INT attempt up to INT mate rescues per pair [64]
-S, --unpaired-cost INT penalty for an unpaired read pair [17]
-8, --no-patch-aln do not patch banded alignments by
locally aligning unaligned regions
--xdrop-alignment use X-drop heuristic
(much faster for long-read alignment)
--max-gap-length INT maximum gap length allowed in each contiguous
alignment (for X-drop alignment) [40]
scoring:
-q, --match INT use this match score [1]
-z, --mismatch INT use this mismatch penalty [4]
--score-matrix FILE use this 4x4 integer substitution scoring
matrix (in the order ACGT)
-o, --gap-open INT use this gap open penalty [6]
-y, --gap-extend INT use this gap extension penalty [1]
-L, --full-l-bonus INT the full-length alignment bonus [5]
-2, --drop-full-l-bonus remove the full length bonus from the score
before sorting and MQ calculation
-a, --hap-exp FLOAT the exponent for haplotype consistency
likelihood in alignment score [1]
--recombination-penalty NUM use this log recombination penalty
for GBWT haplotype scoring [20.7]
-A, --qual-adjust perform base quality adjusted alignments
(requires base quality input)
preset:
-m, --alignment-model STR use a preset alignment scoring model, either
"short" (default) or "long" (ONT/PacBio)
"long" is equivalent to
`-u 2 -L 63 -q 1 -z 2 -o 2 -y 1 -w 128 -O 32`
input:
-s, --sequence STR align a string to the graph in graph.vg
using partial order alignment
-V, --seq-name STR name the sequence STR
(for graph modification with new named paths)
-T, --reads FILE take reads (one per line) from FILE,
write alignments to stdout
-b, --hts-input FILE align reads from stdin htslib-compatible FILE
(BAM/CRAM/SAM), alignments to stdout
-G, --gam-input FILE realign GAM input
-f, --fastq FILE input FASTQ or (2-line format) FASTA, maybe
compressed; two allowed, one for each mate
-F, --fasta FILE align the sequences in a FASTA file that may
have multiple lines per reference sequence
--comments-as-tags intepret comments in name lines as SAM-style
tags and annotate alignments with them
-i, --interleaved FASTQ or GAM is interleaved paired-ended
-N, --sample NAME for --reads input, add this sample
-R, --read-group NAME for --reads input, add this read group
output:
-j, --output-json output JSON rather than an alignment stream
(helpful for debugging)
-%, --gaf output alignments in GAF format
-5, --surject-to TYPE surject the output into the graph's paths,
writing TYPE {bam, sam, cram}
--ref-paths FILE ordered list of paths in graph, one per line
or HTSlib .dict, for HTSLib @SQ headers
--ref-name NAME reference assembly in graph for HTSlib output
-9, --buffer-size INT buffer this many alignments together
before outputting in GAM [512]
-X, --compare realign -G GAM input, writing alignment with
"correct" field set to overlap with input
-v, --refpos-table for efficient testing output a table of
name, chr, pos, mq, score
-K, --keep-secondary produce alignments for secondary input
alignments in addition to primary ones
-M, --max-multimaps INT produce up to INT alignments per read [1]
-Q, --mq-max INT cap the mapping quality at INT [60]
--exclude-unaligned exclude reads with no alignment
-D, --debug print debugging information to stderr
-^, --log-time print runtime to stderr
-h, --help print this help message to stderr and exit
usage: vg minimizer [options] -d graph.dist -o graph.min graph
Builds a (w, k)-minimizer index or a (k, s)-syncmer index of the threads in the
GBWT. The graph can be any HandleGraph, which will be made into a GBWTGraph.
The transformation can be avoided by providing a GBWTGraph or a GBZ graph.
Required options:
-d, --distance-index FILE annotate hits with positions in this distance index
-o, --output-name FILE store the index in a file
Minimizer options:
-k, --kmer-length N length of the kmers in the index [29] (max 31)
-w, --window-length N choose minimizer from a window of N kmers [11]
-c, --closed-syncmers index closed syncmers instead of minimizers
-s, --smer-length N use smers of length N in closed syncmers [18]
Weighted minimizers:
-W, --weighted use weighted minimizers
--threshold N downweight kmers with more than N hits [500]
--iterations N downweight frequent kmers by N iterations [3]
--fast-counting use the fast kmer counting algorithm (default)
--save-memory use the space-efficient kmer counting algorithm
--hash-table N use 2^N-cell hash tables for kmer counting
(default: guess)
Other options:
-z, --zipcode-name FILE store the distances that are too big in afile
if no -z, some distances may be discarded
-l, --load-index FILE load this index and insert the new kmers into it
(overrides minimizer / weighted minimizer options)
-g, --gbwt-name FILE use this GBWT index (required with a non-GBZ graph)
-E, --rec-mode use recombination-aware MinimizerIndex
-p, --progress show progress information
-t, --threads N use N threads for index construction [8]
(using more than 16 threads rarely helps)
--no-dist build the index without distance index annotations
(not recommended)
-h, --help print this help message to stderr and exit
usage: vg mod [options] <graph.vg> >[mod.vg]
Modifies graph, outputs modified on stdout.
options:
-c, --compact-ids should we sort and compact the ID space? (default no)
-b, --break-cycles break graph cycles with approximate topological sort
-n, --normalize normalize graph so edges are always non-redundant
(nodes have unique starting and ending bases relative
to neighbors, edges that do not introduce new paths
are removed, and neighboring nodes are merged)
-U, --until-normal N iterate normalization at most N times
-z, --nomerge-pre STR do not let normalize (-n/-U) zip up any pair of nodes
that both belong to path with prefix STR
-E, --unreverse-edges flip doubly-reversing edges so that they are
represented on the forward strand of the graph
-s, --simplify remove redundancy from the graph
that will not change its path space
-d, --dagify-step N copy strongly connected components of graph N times,
forwording edges from old to new copies
to convert the graph into a DAG
-w, --dagify-to N copy strongly connected components of the graph,
forwarding edges from old to new copies
to convert the graph into a DAG
until shortest path through each SCC is N bases long
-L, --dagify-len-max N stop a dagification step if the unrolling component
has this much sequence
-f, --unfold N represent inversions accessible up to N from
the forward component of the graph
-O, --orient-forward orient the nodes in the graph forward
-N, --remove-non-path keep only nodes and edges which are part of paths
-A, --remove-path keep only nodes and edges which aren't part of a path
-k, --keep-path NAME keep only nodes and edges in the path (may repeat)
-V, --invert-keep-path keep only nodes and edges in paths not passed to -k
-R, --remove-null remove nodes with no sequence, forwarding their edges
-g, --subgraph ID gets the subgraph rooted at node ID (may repeat)
-x, --context N steps the subgraph out by N steps [1]
-p, --prune-complex remove nodes that are reached by paths of --length
which cross more than --edge-max edges
-S, --prune-subgraphs remove subgraphs which are shorter than --length
-l, --length N for pruning complex regions and short subgraphs
-X, --chop N chop nodes in the graph so they are <=N bp long
-u, --unchop where two nodes are only connected to each other and
by only one edge, replace the pair with a single node
that is the concatenation of their labels
-e, --edge-max N consider paths which make edge choices at <= N points
-M, --max-degree N unlink nodes that have edge degree greater than N
-m, --markers join all head and tails nodes to marker nodes
(### starts and $$$ ends) of --length, for debugging
-y, --destroy-node ID remove node with given id
-a, --cactus convert to cactus graph representation
-v, --sample-vcf FILE for a graph with allele paths,
compute the sample graph from the given VCF
-G, --sample-graph FILE subset augmented graph to sample graph via Locus file
-t, --threads N for parallel tasks, use this many threads
-h, --help print this help message to stderr and exit
usage: vg mpmap [options] -x graph.xg -g index.gcsa [-f reads1.fq [-f reads2.fq] | -G reads.gam] > aln.gamp
Multipath align reads to a graph.
basic options:
-h, --help print this help message to stderr and exit
graph/index:
-x, --graph-name FILE graph (required; XG recommended but other formats
are acceptable: see `vg convert`)
-g, --gcsa-name FILE use this GCSA2 (FILE) & LCP (FILE.lcp) index pair
for MEMs (required; see `vg index`)
-d, --dist-name FILE use this snarl distance index for clustering
(recommended, see `vg index`)
-s, --snarls FILE align to alternate paths in these snarls
(unnecessary if providing -d, see `vg snarls`)
input:
-f, --fastq FILE input FASTQ (possibly gzipped), can be given twice
for paired ends (for stdin use -)
-i, --interleaved input contains interleaved paired ends
-C, --comments-as-tags intepret comments in name lines as SAM-style tags
and annotate alignments with them
algorithm presets:
-n, --nt-type TYPE sequence type preset: 'DNA' for genomic data,
'RNA' for transcriptomic data [RNA]
-l, --read-length TYPE read length preset: {very-short, short, long}
(approx. <50bp, 50-500bp, and >500bp) [short]
-e, --error-rate TYPE error rate preset: {low, high}
(approx. PHRED >20 and <20) [low]
output:
-F, --output-fmt TYPE format to output alignments in:
'GAMP' for multipath alignments,
'GAM'/'GAF' for single-path alignments,
'SAM'/'BAM'/'CRAM' for linear reference alignments
(may also require -S) [GAMP]
-S, --ref-paths FILE paths in graph are 1) one per line in a text file
or 2) in an HTSlib .dict, to treat as
reference sequences for HTSlib formats (see -F)
[all reference paths, all generic paths]
--ref-name NAME reference assembly in graph to use for
HTSlib formats (see -F) [all references]
-N, --sample NAME add this sample name to output
-R, --read-group NAME add this read group to output
-p, --suppress-progress do not report progress to stderr
computational parameters:
-t, --threads INT number of compute threads to use [all available]
advanced options:
algorithm:
-X, --not-spliced do not form spliced alignments, even with -n RNA
-M, --max-multimaps INT report up to INT mappings per read [10 RNA / 1 DNA]
-a, --agglomerate-alns combine separate multipath alignments into
one (possibly disconnected) alignment
-r, --intron-distr FILE intron length distribution
(from scripts/intron_length_distribution.py)
-Q, --mq-max INT cap mapping quality estimates at this much [60]
-b, --frag-sample INT look for INT unambiguous mappings to
estimate the fragment length distribution [1000]
-I, --frag-mean FLOAT mean for pre-determined fragment length distribution
(also requires -D)
-D, --frag-stddev FLOAT standard deviation for pre-determined fragment
length distribution (also requires -I)
-G, --gam-input FILE input GAM (for stdin, use -)
-u, --map-attempts INT perform up to INT mappings per read (0 for no limit)
[24 paired / 64 unpaired]
-c, --hit-max INT use at most this many hits for any match seeds
(0 for no limit) [1024 DNA / 100 RNA]
scoring:
-A, --no-qual-adjust do not perform base quality adjusted alignments
even when base qualities are available
-q, --match INT use INT match score [1]
-z, --mismatch INT use INT mismatch penalty [4 low error, 1 high error]
-o, --gap-open INT use INT gap open penalty [6 low error, 1 high error]
-y, --gap-extend INT use INT gap extension penalty [1]
-L, --full-l-bonus INT add INT score to alignments that align each
end of the read [mismatch+1 short, 0 long]
-w, --score-matrix FILE use this 4x4 integer substitution scoring matrix
(in the order ACGT)
-m, --remove-bonuses remove full length alignment bonus in reported score
usage: vg pack [options]
options:
-x, --xg FILE use this basis graph (does not have to be xg format)
-o, --packs-out FILE write compressed coverage packs to this output file
-i, --packs-in FILE begin by summing coverage packs from each provided FILE
-g, --gam FILE read alignments from this GAM file ('-' for stdin)
-a, --gaf FILE read alignments from this GAF file ('-' for stdin)
-d, --as-table write table on stdout representing packs
-D, --as-edge-table write table on stdout representing edge coverage
-u, --as-qual-table write table on stdout representing average node mapqs
-e, --with-edits record and write edits
rather than only recording graph-matching coverage
-b, --bin-size N number of sequence bases per CSA bin [inf]
-n, --node ID write table for only specified node(s)
-N, --node-list FILE white space or line delimited list of nodes to collect
-Q, --min-mapq N ignore reads with MAPQ < N
and positions with base quality < N [0]
-c, --expected-cov N expected coverage. used only for memory tuning [128]
-s, --trim-ends N ignore the first and last N bases of each read
-t, --threads N use N threads [numCPUs]
-h, --help print this help message to stderr and exit
usage: vg paths [options]
-h, --help print this help message to stderr and exit
input:
-x, --xg FILE use the paths and haplotypes in this graph FILE
Supports GBZ haplotypes. (also accepts -v, --vg)
-g, --gbwt FILE use the threads in the GBWT index in FILE
(graph also required for most output options;
-g takes priority over -x)
output graph (.vg format):
-V, --extract-vg output a path-only graph covering the selected paths
-d, --drop-paths output a graph with the selected paths removed
-r, --retain-paths output a graph with only the selected paths retained
-n, --normalize-paths output a graph where equivalent paths in a site are
merged (using selected paths to snap to if possible)
output path data:
-X, --extract-gam print (as GAM alignments) stored paths in the graph
-A, --extract-gaf print (as GAF alignments) stored paths in the graph
-L, --list print (one per line) path (or thread) names
-E, --lengths print a list of path names (as with -L)
but paired with their lengths
-M, --metadata print a table of path names and their metadata
-C, --cyclicity print a list of path names (as with -L)
but paired with flag denoting the cyclicity
-F, --extract-fasta print the paths in FASTA format
-c, --coverage print the coverage stats for selected paths
(not including cycles)
path selection:
-p, --paths-file FILE select paths named in a file (one per line)
-Q, --paths-by STR select paths with the given name prefix
-S, --sample STR select haplotypes or reference paths for this sample
-a, --variant-paths select variant paths added by 'vg construct -a'
-G, --generic-paths select generic, non-reference, non-haplotype paths
-R, --reference-paths select reference paths
-H, --haplotype-paths select haplotype paths
configuration:
-o, --overlay apply a ReferencePathOverlayHelper to the graph
-t, --threads N number of threads to use [all available]
applies only to snarl finding within -n
usage: vg prune [options] <graph.vg> >[output.vg]
Prunes the complex regions of the graph for GCSA2 indexing.
Pruning the graph removes embedded paths.
Pruning parameters:
-k, --kmer-length N kmer length used for pruning
defaults: 24 with -P; 24 with -r; 24 with -u
-e, --edge-max N remove the edges on kmers making > N edge choices
defaults: 3 with -P; 3 with -r; 3 with -u
-s, --subgraph-min N remove subgraphs of < N bases
defaults: 33 with -P; 33 with -r; 33 with -u
-M, --max-degree N if N > 0, remove nodes with degree > N before pruning
defaults: 0 with -P; 0 with -r; 0 with -u
Pruning modes (-P, -r, and -u are mutually exclusive):
-P, --prune simply prune the graph (default)
-r, --restore-paths restore the edges on non-alt paths
-u, --unfold-paths unfold non-alt paths and GBWT threads
-v, --verify-paths verify that the paths exist after pruning
(potentially very slow)
Unfolding options:
-g, --gbwt-name FILE unfold the threads from this GBWT index
-m, --mapping FILE store node mapping for duplicates (required with -u)
-a, --append-mapping append to the existing node mapping
Other options:
-p, --progress show progress
-t, --threads N use N threads [8]
-d, --dry-run determine the validity of the combination of options
-h, --help print this help message to stderr and exit
usage: vg rna [options] graph.[vg|pg|hg|gbz] > splicing_graph.[vg|pg|hg]
General options:
-t, --threads INT number of compute threads to use [1]
-p, --progress show progress
-h, --help print this help message to stderr and exit
Input options:
-n, --transcripts FILE transcript file(s) in gtf/gff format (may repeat)
-m, --introns FILE intron file(s) in bed format (may repeat)
-y, --feature-type NAME parse only this feature type in the GTF/GFF
(parses all if empty) [exon]
-s, --transcript-tag NAME use this attribute tag in the GTF/GFf file(s) as ID
to group exons and name paths [transcript_id]
-l, --haplotypes FILE project transcripts onto haplotypes in GBWT index
-z, --gbz-format input graph is GBZ format (has graph & GBWT index)
Construction options:
-j, --use-hap-ref use haplotype paths in GBWT index as references
(disables projection)
-e, --proj-embed-paths project transcripts onto embedded haplotype paths
-c, --path-collapse TYPE collapse identical transcript paths across
no|haplotype|all paths [haplotype]
-k, --max-node-length INT chop nodes longer than INT (disable with 0) [0]
-d, --remove-non-gene remove intergenic and intronic regions
(deletes all paths in the graph)
-o, --do-not-sort do not topological sort and compact the graph
DON'T FORGET TO EMBED PATHS:
-r, --add-ref-paths add reference transcripts as embedded paths
-a, --add-hap-paths add projected transcripts as embedded paths
Output options:
-b, --write-gbwt FILE write pantranscriptome transcript paths as GBWT
-v, --write-hap-gbwt FILE write input haplotypes as a GBWT
with node IDs matching the output graph
-f, --write-fasta FILE write pantranscriptome transcript sequences to here
-i, --write-info FILE write pantranscriptome transcript info table as TSV
-q, --out-exclude-ref exclude reference transcripts from pantranscriptome
-g, --gbwt-bidirectional use bidirectional paths in GBWT index construction
usage: vg sim [options]
Samples sequences from the xg-indexed graph.
basic options:
-h, --help print this help message to stderr and exit
-x, --xg-name FILE use the graph in FILE (required)
-n, --num-reads N simulate N reads or read pairs
-l, --read-length N simulate reads of length N
-r, --progress show progress information
output options:
-a, --align-out write alignments in GAM-format
-q, --fastq-out write reads in FASTQ format
-J, --json-out write alignments in JSON-format GAM (implies -a)
--multi-position annotate with multiple reference positions
simulation parameters:
-F, --fastq FILE match the error profile of NGS reads in FILE,
repeat for paired reads (ignores -l,-f)
-I, --interleaved reads in FASTQ (-F) are interleaved read pairs
-s, --random-seed N use this specific seed for the PRNG
-e, --sub-rate FLOAT base substitution rate [0.0]
-i, --indel-rate FLOAT indel rate [0.0]
-d, --indel-err-prop FLOAT proportion of trained errors from -F
that are indels [0.01]
-S, --scale-err FLOAT scale trained error probs from -F by FLOAT [1.0]
-f, --forward-only don't simulate from the reverse strand
-p, --frag-len N make paired end reads with fragment length N
-v, --frag-std-dev FLOAT use this standard deviation
for fragment length estimation
-N, --allow-Ns allow reads to be sampled with Ns in them
--max-tries N attempt sampling operations up to N times [100]
-t, --threads N number of compute threads (only when using -F) [1]
simulate from paths:
-P, --path NAME simulate from this path
(may repeat; cannot also give -T)
-A, --any-path simulate from any path (overrides -P)
-m, --sample-name NAME simulate from this sample (may repeat)
-R, --ploidy-regex RULES use this comma-separated list of colon-delimited
REGEX:PLOIDY rules to assign ploidies to contigs
not visited by the selected samples, or to all
contigs simulated from if no samples are used.
Unmatched contigs get ploidy 2
-g, --gbwt-name FILE use samples from this GBWT index
-T, --tx-expr-file FILE simulate from an expression profile formatted as
RSEM output (cannot also give -P)
-H, --haplo-tx-file FILE transcript origin info table from vg rna -i
(required for -T on haplotype transcripts)
-u, --unsheared sample from unsheared fragments
-E, --path-pos-file FILE output a TSV with sampled position on path
of each read (requires -F)
usage: vg stats [options] [<graph file>]
options:
-z, --size size of graph
-N, --node-count number of nodes in graph
-E, --edge-count number of edges in graph
-l, --length length of sequences in graph
-L, --self-loops number of self-loops
-s, --subgraphs describe subgraphs of graph
-H, --heads list the head nodes of the graph
-T, --tails list the tail nodes of the graph
-e, --nondeterm list the nondeterministic edge sets
-c, --components print the strongly connected components of the graph
-A, --is-acyclic print if the graph is acyclic or not
-n, --node ID consider node with the given id
-d, --to-head show distance to head for each provided node
-t, --to-tail show distance to head for each provided node
-a, --alignments FILE compute stats for reads aligned to the graph
-r, --node-id-range X:Y where X and Y are the smallest and largest
node id in the graph, respectively
-o, --overlap PATH for each overlapping path mapping in the graph write:
PATH, other_path, rank1, rank2
multiple allowed; limit comparison to those provided
-O, --overlap-all print overlap table for cartesian product of paths
-R, --snarls print statistics for each snarl
--snarl-contents print table of <snarl, depth, parent, node ids>
--snarl-sample NAME print out reference coordinates on given sample
-C, --chains print statistics for each chain
-F, --format graph type {VG-Protobuf, PackedGraph, HashGraph, XG}
Can't detect Protobuf if graph read from stdin
-D, --degree-dist print degree distribution of the graph.
-b, --dist-snarls FILE print sizes/depths of the snarls in distance index
-p, --threads N number of threads to use [all available]
-v, --verbose output longer reports
-P, --progress show progress
-h, --help print this help message to stderr and exit
usage: vg surject [options] <aln.gam> >[proj.cram]
Transforms alignments to be relative to particular paths.
options:
-x, --xg-name FILE use this graph or xg index (required)
-t, --threads N number of threads to use
-p, --into-path NAME surject into this path or its subpaths (may repeat)
default: reference, then non-alt generic
-F, --into-paths FILE surject into path names listed in
HTSlib sequence dictionary or path list FILE
-n, --into-ref NAME surject into this reference assembly
-i, --interleaved GAM is interleaved paired-ended, so pair reads
when outputting HTS formats
-M, --multimap include secondary alignments to all
overlapping paths instead of just primary
-G, --gaf-input input file is GAF instead of GAM
-m, --gamp-input input file is GAMP instead of GAM
-c, --cram-output write CRAM to stdout
-b, --bam-output write BAM to stdout
-s, --sam-output write SAM to stdout
-u, --supplementary divide into supplementary alignments as necessary
-l, --subpath-local let the multipath mapping surjection produce local
(rather than global) alignments
-T, --max-tail-len N only align up to N bases of read tails [10000]
-g, --max-graph-scale X make reads unmapped if alignment target subgraph
size exceeds read length by a factor of X
(default: 819.2 or 134218 with -S)
-P, --prune-low-cplx prune short/low complexity anchors in realignment
-I, --max-slide N look for offset duplicates of anchors up to N bp
away when pruning (default: 6)
-a, --max-anchors N use <= N anchors per target path [unlimited]
-S, --spliced interpret long deletions against paths
as spliced alignments
-A, --qual-adj adjust scoring for base qualities, if available
-E, --extra-gap-cost N for dynamic programming, add N to the gap open cost
of the 10x-scaled scoring parameters
-N, --sample NAME set this sample name for all reads
-R, --read-group NAME set this read group for all reads
-f, --max-frag-len N reads with fragment lengths greater than N won't be
marked properly paired in SAM/BAM/CRAM
-L, --list-all-paths annotate SAM records with a list of all attempted
re-alignments to paths in SS tag
-H, --graph-aln annotate SAM records with cs-style difference string
of the pre-surjected graph alignment in GR tag
-C, --compression N level for compression [0-9]
-V, --no-validate skip checking whether alignments plausibly are
against the provided graph
-w, --watchdog-timeout N warn when reads take more than N seconds to surject
-r, --progress show progress
-h, --help print this help message to stderr and exit
usage: vg view [options] [ <graph.vg> | <graph.json> | <aln.gam> | <read1.fq> [<read2.fq>] ]
options:
-g, --gfa output GFA format (default)
-F, --gfa-in input GFA format, reducing overlaps if they occur
-v, --vg output VG format [DEPRECATED, use vg convert]
-V, --vg-in input VG format only
-j, --json output JSON format
-J, --json-in input JSON format (use with e.g. -a as necessary)
-c, --json-stream streaming conversion of a VG format graph
in line delimited JSON format
(this cannot be loaded directly via -J)
-G, --gam output GAM format (vg alignment format)
-Z, --translation-in input is a graph translation description
-t, --turtle output RDF/turtle format (can not be loaded by VG)
-T, --turtle-in input turtle format.
-r, --rdf-base-uri URI set base uri for the RDF output
-a, --align-in input GAM format, or JSON version of GAM format
-A, --aln-graph GAM add alignments from GAM to the graph
-q, --locus-in input is Locus format, or JSON version of it
-z, --locus-out output is Locus format
-Q, --loci FILE input is Locus format for use by dot output
-d, --dot output dot format
-S, --simple-dot simple alignments & no node labels in dot output
-u, --noseq-dot show size instead of sequence in dot output
-e, --ascii-labels label paths/superbubbles with char/colors vs. emoji
-Y, --ultra-label label nodes with emoji/colors for ultrabubbles
-m, --skip-missing skip mappings to nodes not in the graph
when drawing alignments
-C, --color color nodes not in reference path (DOT OUTPUT ONLY)
-p, --show-paths show paths in dot output
-w, --walk-paths add labeled edges to represent paths in dot output
-n, --annotate-paths add labels to edges to represent paths in dot output
-M, --show-mappings with -p, print the mappings in each path in JSON
-I, --invert-ports invert edge ports in dot so that ne->nw is reversed
-s, --random-seed N use this seed for path symbols in dot output
-b, --bam input BAM or other htslib-parseable alignments
-f, --fastq-in input fastq (output defaults to GAM). Takes two
positional file arguments if paired
-X, --fastq-out output fastq (input defaults to GAM)
-i, --interleaved fastq is interleaved paired-ended
-L, --pileup output VG Pileup format
-l, --pileup-in input VG Pileup format, or JSON version of it
-B, --distance-in input distance index
-R, --snarl-in input VG Snarl format
-E, --snarl-traversal-in input VG SnarlTraversal format
-K, --multipath-in input VG MultipathAlignment format (GAMP),
or JSON version of it
-k, --multipath output VG MultipathAlignment format (GAMP)
-D, --expect-duplicates don't warn about duplicate nodes or edges
-x, --extract-tag TAG extract and concatenate messages with the given tag
--first only extract first message with the requested tag
--verbose explain the file being read with --extract-tag
-7, --threads N for parallel operations use this many threads [1]
-h, --help print this help message to stderr and exit
Bugs can be reported at: https://github.com/vgteam/vg/issues
For technical support, please visit: https://www.biostars.org/tag/vg/