vg manpage

% vg(1) | Variation Graph Toolkit

NAME

vg - variation graph tool, vg version v1.69.0 "Bologna".

DESCRIPTION

vg is a toolkit for variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods.

For more in-depth explanations of tools and workflows, see the vg wiki page.

SYNOPSIS

This is an incomplete list of vg subcommands. For a complete list, run vg help.

Graph construction and indexing See the wiki page for an overview of vg indexes.
- vg autoindex: automatically construct a graph and indexes for a specific workflow (e.g. giraffe, rpvg). wiki page
- vg construct: manually construct a graph from a reference and variants. wiki page
- vg index: manually build individual indexes (xg, distance, GCSA, etc). wiki page
- vg gbwt: manually build and manipulate GBWTs and indexes (GBWTgraph, GBZ, r-index). wiki page
- vg minimizer: manually build a minimizer index for mapping.
- vg haplotypes: haplotype sample a graph. Recommended for mapping with giraffe. wiki page
Read mapping
- vg giraffe: fast haplotype-aware short read alignment. wiki page
- vg mpmap: splice-aware multipath alignment of short reads. wiki page
- vg map: MEM-based read alignment. wiki page
Downstream analyses
- vg pack: convert alignments to a compact coverage index. Used with vg call
- vg call: call or genotype VCF variants. Uses vg pack. wiki page
- vg rna: construct splicing graphs and pantranscriptomes. wiki page. Also see rpvg
- vg deconstruct: create a VCF from variation in the graph. wiki page
Working with read alignments
- vg gamsort: sort a GAM/GAF file or index a sorted GAM file.
- vg filter: filter alignments by properties.
- vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram).
- vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf).
- vg sim: simulate reads from a graph. wiki page
Graph and read statistics
- vg stats: get stats about the graph.
- vg paths: get stats about the paths. wiki page
- vg gbwt: get stats about a GBWT.
- vg filter: get stats about alignments (use --tsv-out).
Manipulating a graph
- vg mod: filter, transform, and edit the graph.
- vg prune: prune the graph for GCSA2 indexing.
- vg ids: manipulate graph node ids.
- vg paths: manipulate paths in a graph.
- vg gbwt: manipulate GBWTs and associated indexes. wiki page
- vg annotate: annotate a graph or alignments.
Conversion between formats
- vg convert: convert between handle graph formats and GFA, and between alignment formats.
- vg view: convert between non-handle graph formats and alignment formats (dot, json, turtle...).
- vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram).
- vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf).
- vg paths: extract a fasta from a graph. wiki page
Subgraph extraction
- vg chunk: split a graph and/or alignment into smaller chunks.
- vg find: use an index to find nodes, edges, kmers, paths, or positions.

COMMANDS

annotate: annotate alignments with graphs and graphs with alignments

usage: vg annotate [options] >output.{gam,vg,tsv}
graph annotation options:
  -x, --xg-name FILE     xg index or graph to annotate (required)
  -b, --bed-name FILE    BED file to convert to GAM (may repeat)
  -f, --gff-name FILE    GFF3 file to convert to GAM (may repeat)
  -g, --ggff             output GGFF subgraph annotation file
                         instead of GAM (requires -s)
  -F, --gaf-output       output in GAF format rather than GAM
  -s, --snarls FILE      snarls to expand GFF intervals into
alignment annotation options:
  -a, --gam FILE         alignments to annotate (required)
  -x, --xg-name FILE     xg index of the graph against which the
                         alignments are aligned (required)
  -p, --positions        annotate alignments with reference positions
  -m, --multi-position   annotate alignments with multiple reference positions
  -l, --search-limit N   when annotating with -p, search this far for paths, or
                         -1 to not search [0 (auto from read length)]
  -b, --bed-name FILE    annotate alignments with overlapping region names
                         from this BED (may repeat)
  -n, --novelty          output TSV table with header
                         describing how much of each Alignment is novel
  -P, --progress         show progress
  -t, --threads N        use the specified number of threads
  -h, --help             print this help message to stderr and exit

autoindex: mapping tool-oriented index construction from interchange formats

usage: vg autoindex [options]
output:
  -p, --prefix PREFIX    prefix to use for all output [index]
  -w, --workflow NAME    workflow to produce indexes for (may repeat) [map]
                         {map, mpmap, rpvg, giraffe, sr-giraffe, lr-giraffe}
input data:
  -r, --ref-fasta FILE   FASTA file with the reference sequence (may repeat)
  -v, --vcf FILE         VCF file with sequence names matching -r (may repeat)
  -i, --ins-fasta FILE   FASTA file with sequences of INS variants from -v
  -g, --gfa FILE         GFA file to make a graph from
  -G, --gbz FILE         GBZ file to make indexes from
  -x, --tx-gff FILE      GTF/GFF file with transcript annotations (may repeat)
  -H, --hap-tx-gff FILE  GTF/GFF file with transcript annotations 
                         of a named haplotype (may repeat)
  -n, --no-guessing      do not guess that pre-existing files are indexes
                         i.e. force-regenerate any index not explicitly provided
configuration:
  -f, --gff-feature STR  GTF/GFF feature type (col. 3) to add to graph [exon]
  -a, --gff-tx-tag STR   GTF/GFF tag (in col. 9) for ID [transcript_id]
logging and computation:
  -T, --tmp-dir DIR      temporary directory to use for intermediate files
  -M, --target-mem MEM   target max memory usage (not exact, formatted INT[kMG])
                         [1/2 of available]
  -t, --threads NUM      number of threads [all available]
  -V, --verbosity NUM    log to stderr {0 = none, 1 = basic, 2 = debug}[1]
  -h, --help             print this help message to stderr and exit

call: call or genotype VCF variants

usage: vg call [options] <graph> > output.vcf
Call variants or genotype known variants

support calling options:
  -k, --pack FILE           supports created from vg pack for given input graph
  -m, --min-support M,N     min allele (M) and site (N) support to call [2,4]
  -e, --baseline-error X,Y  baseline error rates for Poisson model for small (X)
                            and large (Y) variants [0.005,0.01]
  -B, --bias-mode           use old ratio-based genotyping algorithm
                            as opposed to probablistic model
  -b, --het-bias M,N        homozygous alt/ref allele must have >= M/N times
                            more support than the next best allele [6,6]
GAF options:
  -G, --gaf                 output GAF genotypes instead of VCF
  -T, --traversals          output all candidate traversals in GAF
                            without doing any genotyping
  -M, --trav-padding N      extend each flank of traversals (from -T) with
                            reference path by N bases if possible
general options:
  -v, --vcf FILE            VCF file to genotype (must have been used
                            to construct input graph with -a)
  -a, --genotype-snarls     genotype every snarl, including reference calls
                            (use to compare multiple samples)
  -A, --all-snarls          genotype all snarls, including nested child snarls
                            (like deconstruct -a)
  -c, --min-length N        genotype only snarls with
                            at least one traversal of length >= N
  -C, --max-length N        genotype only snarls where
                            all traversals have length <= N
  -f, --ref-fasta FILE      reference FASTA
                            (required if VCF has symbolic deletions/inversions)
  -i, --ins-fasta FILE      insertions (required if VCF has symbolic insertions)
  -s, --sample NAME         sample name [SAMPLE]
  -r, --snarls FILE         snarls (from vg snarls) to avoid recomputing.
  -g, --gbwt FILE           only call genotypes present in given GBWT index
  -z, --gbz                 only call genotypes present in GBZ index
                            (applies only if input graph is GBZ)
  -N, --translation FILE    node ID translation (from vg gbwt --translation)
                            to apply to snarl names in output
  -O, --gbz-translation     use the ID translation from the input GBZ to
                            apply snarl names to snarl names/AT fields in output
  -p, --ref-path NAME       reference path to call on (may repeat; default all)
  -S, --ref-sample NAME     call on all paths with this sample
                            (cannot use with -p)
  -o, --ref-offset N        offset in reference path (may repeat; 1 per path)
  -l, --ref-length N        override reference length for output VCF contig
  -d, --ploidy N            ploidy of sample. {1, 2} [2]
  -R, --ploidy-regex RULES  use this comma-separated list of colon-delimited
                            REGEX:PLOIDY rules to assign ploidies to contigs
                            not visited by the selected samples, or to all
                            contigs simulated from if no samples are used.
                            Unmatched contigs get ploidy 2 (or that from -d).
  -n, --nested              activate nested calling mode (experimental)
  -I, --chains              call chains instead of snarls (experimental)
      --progress            show progress
  -t, --threads N           number of threads to use
  -h, --help                print this help message to stderr and exit

chunk: split graph or alignment into chunks

usage: vg chunk [options] > [chunk.vg]
Splits a graph and/or alignment into smaller chunks

Graph chunks are saved to .vg files, read chunks are saved to .gam files,
and haplotype annotations are saved to .annotate.txt files, of the form
<BASENAME>-<i>-<region name or "ids">-<start>-<length>.<ext>.
The BASENAME is specified with -b and defaults to "./chunks".

For a single-range chunk (-p or -r), the graph data is sent to
standard output instead of a file.

options:
  -x, --xg-name FILE            use this graph or xg index to chunk subgraphs
  -G, --gbwt-name FILE          use this GBWT haplotype index
                                for haplotype extraction (for -T)
  -a, --aln-name FILE           chunk alignments instead of graph (may repeat)
  -g, --aln-and-graph           when used in combination with -a,
                                both alignments and graph will be chunked
  -F, --in-gaf                  -a alignment is a sorted bgzipped GAF, not GAM
path chunking:
  -p, --path TARGET             write the chunk in the specified path range
                                (0-based inclusive, multiple allowed)
                                TARGET=path[:pos1[-pos2]] to standard output
  -P, --path-list FILE          for all paths in line separated file, 
                                write chunks for each as in -p
  -e, --input-bed FILE          write chunks for (0-based end-exclusive) regions
  -S, --snarls FILE             write given path-range(s) and all snarls 
                                fully contained in them, as alternative to -c
id range chunking:
  -r, --node-range N:M          write the chunk for this node range to stdout
  -R, --node-ranges FILE        write the chunk for each node range in
                                (newline or whitespace separated) file
  -n, --n-chunks N              generate N id-range chunks, determined via xg
simple alignment chunking:
  -m, --aln-split-size N        split alignments (-a, sort/index not required)
                                up into chunks with at most N reads each
component chunking:
  -C, --components              create a chunk for each connected component.
                                If targets given with (-p, -P, -r, -R),
                                limit to components containing them
  -M, --path-components         create a chunk for each path
                                in the graph's connected component
general:
  -s, --chunk-size N            create chunks spanning N bases
                                (or nodes with -r/-R) for all input regions.
  -o, --overlap N               overlap between chunks when using -s [0]
  -E, --output-bed FILE         write all created chunks to a bed file
  -b, --prefix BASENAME         write output chunk files [./chunk]
                                Files for chunk i will be named
                                <BASENAME>-<i>-<name>-<start>-<length>.<ext> 
  -c, --context-steps N         expand the context of the chunk N node steps [1]
  -l, --context-length N        expand the context of the chunk by N bp [0]
  -T, --trace                   trace haplotype threads in chunks
                                (and only expand forward from input coordinates)
                                Produces .annotate.txt file
                                with haplotype frequencies for each chunk.
      --no-embedded-haplotypes  don't load haplotypes from the graph. It is
                                possible to -T without any haplotypes available.
  -f, --fully-contained         only return GAM alignments that are
                                fully contained within chunk
  -u, --cut-alignments          cut alignments to be fully within the chunk
  -O, --output-fmt STR          output format {vg, pg, hg, gfa} [pg (vg for -T)]
  -t, --threads N               for parallel tasks, use this many threads [1]
  -h, --help                    print this help message to stderr and exit

construct: graph construction

usage: vg construct [options] >new.vg
options:
construct from a reference and variant calls:
  -r, --reference FILE   input FASTA reference (may repeat)
  -v, --vcf FILE         input VCF (may repeat)
  -n, --rename V=F       match contig V in the VCFs to contig F in the FASTAs
                         (may repeat)
  -a, --alt-paths        save paths for alts of variants by SHA1 hash
  -A, --alt-paths-plain  save paths for alts of variants by variant ID
                         if possible, otherwise SHA1
                         (IDs must be unique across all input VCFs)
  -R, --region REGION    specify a VCF contig name or 1-based inclusive region
                         (may repeat, if on different contigs)
  -C, --region-is-chrom  don't attempt to parse the regions (use when reference
                         sequence name could be parsed as a region)
  -z, --region-size N    variants per region to parallelize [1024]
  -t, --threads N        use N threads to construct graph [numCPUs]
  -S, --handle-sv        include structural variants in construction of graph.
  -I, --insertions FILE  a FASTA file containing insertion sequences 
                         (referred to in VCF) to add to graph.
  -f, --flat-alts        don't chop up alternate alleles from input VCF
  -l, --parse-max N      don't chop up alternate alleles from input VCF
                         longer than N [100]
  -i, --no-trim-indels   don't remove the 1bp ref base from indel alt alleles
  -N, --in-memory        construct entire graph in memory before outputting it
construct from a multiple sequence alignment:
  -M, --msa FILE         input multiple sequence alignment
  -F, --msa-format STR   format of the MSA file {fasta, clustal} [fasta]
  -d, --drop-msa-paths   don't add paths for the MSA sequences into the graph
shared construction options:
  -m, --node-max N       limit maximum allowable node sequence size [32]
                         nodes greater than this threshold will be divided
                         note: nodes larger than ~1024 bp can't be GCSA2-indexed
  -p, --progress         show progress
  -h, --help             print this help message to stderr and exit

convert: convert graphs between handle-graph compliant formats as well as GFA

usage: vg convert [options] <input-graph>
input options:
  -g, --gfa-in               input in GFA format
  -r, --in-rgfa-rank N       import rgfa tags with rank <= N as paths [0]
  -b, --gbwt-in FILE         input graph is a GBWTGraph using the GBWT in FILE
      --ref-sample STR       change haplotypes for this sample to
                             reference paths (may repeat)
      --hap-locus STR        change generic paths with this locus
                             to haplotype paths (must be used with --new-sample)
      --new-sample STR       when using --hap-locus, give the new haplotype
                             this sample name (must be used with --hap-locus)
gfa input options (use with -g):
  -T, --gfa-trans FILE       write gfa id conversions to FILE
output options:
  -v, --vg-out               output in VG's original Protobuf format
                             [DEPRECATED: use -p instead].
  -a, --hash-out             output in HashGraph format
  -p, --packed-out           output in PackedGraph format (default)
  -x, --xg-out               output in XG format
  -f, --gfa-out              output in GFA format
  -H, --drop-haplotypes      do not include haplotype paths in the output
                             (useful with GBWTGraph / GBZ inputs)
gfa output options (use with -f):
  -P, --rgfa-path STR        write given path as rGFA tags instead of lines
                             (may repeat, only rank-0 supported)
  -Q, --rgfa-prefix STR      write paths with this prefix as rGFA tags instead
                             of lines (may repeat, only rank-0 supported)
  -B, --rgfa-pline           paths written as rGFA tags also written as lines
  -W, --no-wline             write all paths as GFA P-lines instead of W-lines.
                             allows handling multiple phase blocks 
                             and subranges used together.
      --gbwtgraph-algorithm  always use the GBWTGraph library GFA algorithm.
                             not compatible with other GFA output options
                             or non-GBWT graphs.
      --vg-algorithm         always use the VG GFA algorithm. Works with all
                             options and graph types, but can't preserve
                             original GFA coordinates
      --no-translation       when using the GBWTGraph algorithm, convert graph
                             directly to GFA; do not use the translation
                             to preserve original coordinates
alignment options:
  -G, --gam-to-gaf FILE      convert GAM FILE to GAF
  -F, --gaf-to-gam FILE      convert GAF FILE to GAM
general options:
  -t, --threads N            use N threads [numCPUs]
  -h, --help                 print this help message to stderr and exit

deconstruct: create a VCF from variation in the graph

usage: vg deconstruct [options] [-p|-P] <PATH> <GRAPH>
Output VCF records for Snarls present in a graph (relative to a reference path).
options: 
  -p, --path NAME          a reference path to deconstruct against (may repeat).
  -P, --path-prefix NAME   all paths [minus GBWT threads / non-ref GBZ paths]
                           beginning with NAME used as reference (may repeat).
                           other non-ref paths not considered as samples. 
  -r, --snarls FILE        snarls file (from vg snarls) to avoid recomputing.
  -g, --gbwt FILE          consider alt traversals for GBWT haplotypes in FILE
                           (not needed for GBZ graph input).
  -T, --translation FILE   node ID translation (from vg gbwt --translation)
                           to apply to snarl names and AT fields in output
  -O, --gbz-translation    use the ID translation from the input GBZ to apply
                           snarl names to snarl names and AT fields in output
  -a, --all-snarls         process all snarls, including nested snarls
                           (by default only top-level snarls reported).
  -c, --context-jaccard N  set context mapping size used to disambiguate alleles
                           at sites with multiple reference traversals [10000]
  -u, --untangle-travs     use context mapping fpr reference-relative positions
                           of each step in allele traversals (AP INFO field).
  -K, --keep-conflicted    retain conflicted genotypes in output.
  -S, --strict-conflicts   drop genotypes when we have more than one haplotype
                           for any given phase (set by default for GBWT input).
  -C, --contig-only-ref    only use CONTIG name (not SAMPLE#CONTIG#HAPLOTYPE)
                           for reference if possible (i.e. only one ref sample)
  -L, --cluster F          cluster traversals whose (handle) Jaccard coefficient
                           is >= F together [1.0; experimental]
  -n, --nested             write a nested VCF, plus special tags [experimental]
  -R, --star-allele        use *-alleles to denote alleles that span
                           but do not cross the site. Only works with -n
  -t, --threads N          use N threads
  -v, --verbose            print some status messages
  -h, --help               print this help message to stderr and exit

filter: filter reads and get statistics by read

usage: vg filter [options] <alignment.gam> > out.gam
Filter alignments by properties.

options:
  -M, --input-mp-alns          input is multipath alignments (GAMP), not GAM
  -n, --name-prefix NAME       keep only reads with this name prefix ['']
  -N, --name-prefixes FILE     keep reads with names with any of these prefixes,
                               one per nonempty line
  -e, --exact-name             match read names exactly instead of by prefix
  -a, --subsequence NAME       keep reads that contain this subsequence
  -A, --subsequences FILE      keep reads that contain one of these subsequences
                               one per nonempty line
  -p, --proper-pairs           keep reads annotated as being properly paired
  -P, --only-mapped            keep reads that are mapped
  -X, --exclude-contig REGEX   drop reads with refpos annotations on contigs
                               matching the given regex (may repeat)
  -F, --exclude-feature NAME   drop reads with the given feature
                               in the "features" annotation (may repeat)
  -s, --min-secondary N        minimum score to keep secondary alignment
  -r, --min-primary N          minimum score to keep primary alignment
  -L, --max-length N           drop reads with length > N
  -O, --rescore                re-score reads using default parameters
                               and only alignment information
  -f, --frac-score             normalize score based on length
  -u, --substitutions          use substitution count instead of score
  -W, --overwrite-score        replace stored GAM score with computed/normalized
                               score
  -o, --max-overhang N         drop reads whose alignments begin or end
                               with an insert > N [99999]
  -m, --min-end-matches N      drop reads without >=N matches on each end
  -S, --drop-split             remove split reads taking nonexistent edges
  -x, --xg-name FILE           use this xg index/graph (required for -S and -D)
  -v, --verbose                print out statistics on numbers of reads dropped
  -V, --no-output              print out -v statistics and do not write the GAM
  -T, --tsv-out FIELD[;FIELD]  write TSV of given fields instead of filtered GAM
                               See wiki page:
                               "Getting alignment statistics with vg filter"
  -q, --min-mapq N             drop alignments with mapping quality < N
  -E, --repeat-ends N          drop reads with tandem repeat (motif size <= 2N,
                               spanning >= N bases) at either end
  -D, --defray-ends N          clip back the ends of ambiguously aligned reads
                               up to N bases
  -C, --defray-count N         stop defraying after N nodes visited
                               (used to keep runtime in check) [99999]
  -d, --downsample S.P         drop all but the given portion 0.P of the reads.
                               S may be an integer seed as in SAMtools
  -R, --max-reads N            drop all but N reads. Use on a single thread
  -i, --interleaved            both ends will be dropped if either fails filter
                               assume interleaved input
  -I, --interleaved-all        both ends will be dropped if *both* fail filters
                               assume interleaved input
  -b, --min-base-quality Q:F   drop reads with where fewer than fraction F bases
                               have base quality >= PHRED score Q.
  -G, --annotation K[:V]       keep reads if the annotation is present and 
                               not false/empty. If a value is given, keep reads
                               if the values are equal similar to running
                               jq 'select(.annotation.K==V)' on the json
  -c, --correctly-mapped       keep only reads marked as correctly-mapped
  -l, --first-alignment        keep only the first alignment for each read
                               Must be run with 1 thread
  -U, --complement             apply opposite of the filter from other arguments
  -B, --batch-size N           work in batches of N reads [512]
  -t, --threads N              number of threads [1]
      --progress               show progress
  -h, --help                   print this help message to stderr and exit

find: use an index to find nodes, edges, kmers, paths, or positions

usage: vg find [options] >sub.vg
options:
  -h, --help                  print this help message to stderr and exit
graph features:
  -x, --xg-name FILE          use this xg index or graph (instead of rocksdb db)
  -n, --node ID               find node(s), return 1-hop context as graph
  -N, --node-list FILE        whitespace or line delimited list of nodes to grab
      --mapping FILE          include nodes mapping to the selected node IDs
  -e, --edges-end ID          return edges on end of node with ID
  -s, --edges-start ID        return edges on start of node with ID
  -c, --context STEPS         expand the context of the subgraph this many steps
  -L, --use-length            treat STEPS in -c or M in -r as a length in bases
  -P, --position-in PATH      find the position of -n node in the given path
  -I, --list-paths            write out the path names in the index
  -r, --node-range N:M        get nodes from N to M
  -G, --gam GAM               accumulate the graph touched by GAM's alignments
      --connecting-start POS  find graph from POS (node ID, + or -, node offset)
                              connecting to --connecting-end
      --connecting-end POS    find graph to POS (node ID, + or -, node offset)
                              connecting from --connecting-start
      --connecting-range INT  traverse up to INT bases when going 
                              from --connecting-start to --connecting-end [100]
subgraphs by path range:
  -p, --path TARGET           find the node(s) in the specified path range(s)
                              TARGET=path[:pos1[-pos2]]
  -R, --path-bed FILE         read our targets from the given BED FILE
  -E, --path-dag              with -p or -R, gets any node in the partial order
                              from pos1 to pos2, assumes id sorted DAG
  -W, --save-to PREFIX        instead of writing target subgraphs to stdout,
                              write one per given target to a separate file
                              named PREFIX[path]:[start]-[end].vg
  -K, --subgraph-k K          instead of graphs, write kmers from the subgraphs
  -H, --gbwt FILE             when enumerating kmers from subgraphs, determine
                              their frequencies in this GBWT haplotype index
alignments:
  -l, --sorted-gam FILE       use this sorted, indexed GAM file
  -F, --sorted-gaf FILE       use this sorted, indexed GAF file
  -o, --alns-on N:M           write alignments which align to any of the
                              nodes between N and M (inclusive)
  -A, --to-graph VG           get alignments to the provided subgraph
sequences:
  -g, --gcsa FILE             use this GCSA2 index of the graph's sequence space
                              (required for sequence queries)
  -S, --sequence STR          search for sequence STR using
  -M, --mems STR              describe the super-maximal exact matches
                              of the STR (GCSA2) in JSON
  -B, --reseed-length N       find non-super-maximal MEMs inside SMEMs length>=N
  -f, --fast-reseed           use fast SMEM reseeding algorithm
  -Y, --max-mem N             maximum length of the MEM [GCSA2 order]
  -Z, --min-mem N             minimum length of the MEM [1]
  -D, --distance              return distance on path between pair of nodes (-n)
                              if -P not used, best path chosen heurstically
  -Q, --paths-named STR       return all paths with name prefix STR (may repeat)

gamsort: sort a GAM/GAF file or index a sorted GAM file

usage: vg gamsort [options] input > output

Sort a GAM/GAF file, or index a sorted GAM file.

General options:
  -p, --progress          show progress
  -s, --shuffle           shuffle reads by hash
  -t, --threads N         use N worker threads [4 for GAM, 1 for GAF]

GAM sorting options:
  -i, --index FILE        produce an index of the sorted GAM file
  -d, --dumb-sort         use naive sorting algorithm
                          (no tmp files, faster for small GAMs)

GAF sorting options:
  -G, --gaf-input         input is a GAF file
  -c, --chunk-size N      number of reads per chunk [1000000]
  -m, --merge-width N     number of files to merge at once [32]
  -S, --stable            use stable sorting
  -g, --gbwt-output FILE  write a GBWT index of the paths to FILE
  -b, --bidirectional     make the GBWT index bidirectional
  -h, --help              print this help message to stderr and exit

gbwt: build and manipulate GBWT and GBZ files

usage: vg gbwt [options] [args]

Manipulate GBWTs. Input GBWTs are loaded from input args
or built in earlier steps. See wiki page "VG GBWT Subcommand".
The input graph is provided with one of -x, -G, or -Z

General options:
  -h, --help              print this help message to stderr and exit
  -x, --xg-name FILE      read the graph from FILE
  -o, --output FILE       write output GBWT to FILE
  -d, --temp-dir DIR      use directory DIR for temporary files
  -p, --progress          show progress and statistics

GBWT construction parameters (for steps 1 and 4):
      --buffer-size N     construction buffer size in millions of nodes[100]
      --id-interval N     store path IDs at 1/N positions [1024]

Multithreading:
      --num-jobs N        use at most N parallel build jobs
                          (for -v, -G, -A, -l, -P) [4]
      --num-threads N     use N parallel search threads
                          (for -b and -r) [8]

Step 1: GBWT construction (requires -o and one of { -v, -G, -Z, -E, A }):
  -v, --vcf-input         index the haplotypes in the VCF files specified in
                          input args in parallel (requires -x, implies -f);
                          (inputs must be over different contigs,
                          does not store graph contigs in the GBWT)
      --preset X          use preset X (available: 1000gp)
      --inputs-as-jobs    create one build job for each input
                          instead of using first-fit heuristic
      --parse-only        store the VCF parses without building GBWTs
                          (use -o for file name prefix; skips later steps)
      --ignore-missing    don't warn when variants are missing from the graph
      --actual-phasing    don't treat unphased homozygous genotypes as phased
      --force-phasing     replace unphased genotypes with randomly phased ones
      --discard-overlaps  skip overlapping alternate alleles if the overlap
                          cannot be resolved instead of creating a phase break
      --batch-size N      index the haplotypes in batches of N samples [200]
      --sample-range X-Y  index samples X to Y (inclusive, 0-based)
      --rename V=P        VCF contig V matches path P in the graph (may repeat)
      --vcf-variants      variants in graph use VCF contig names, not path names
      --vcf-region C:X-Y  restrict VCF contig C to coordinates X to Y
                          (inclusive, 1-based; may repeat)
      --exclude-sample X  do not index the sample with name X
                          (faster than -R; may repeat)
  -G, --gfa-input         index walks or paths in the GFA file (one input arg)
      --max-node N        chop long segments into nodes of at most N bp
                          (use 0 to disable) [1024]
      --path-regex X      parse metadata as haplotypes from path names
                          using regex X instead of vg-parser-compatible rules
      --path-fields X     parse metadata as haplotypes, mapping regex submatches
                          to these fields instead of vg-parser-compatible rules
      --translation FILE  write the segment to node translation table to FILE
  -Z, --gbz-input         extract GBWT & GBWTGraph from GBZ from (one) input arg
  -I, --gg-in FILE        load GBWTGraph from FILE and GBWT from (one) input arg
  -E, --index-paths       index the embedded non-alt paths in the graph
                          (requires -x, no input args)
  -A, --alignment-input   index the alignments in the GAF files specified
                          in input args (requires -x)
      --gam-format        input files are in GAM format instead of GAF format

Step 2: Merge multiple input GBWTs (requires -o):
  -m, --merge             use the insertion algorithm
  -f, --fast              fast merging algorithm (node ids must not overlap)
  -b, --parallel          use the parallel algorithm
      --chunk-size N      search in chunks of N sequences [1]
      --pos-buffer N      use N MiB position buffers for each search thread [64]
      --thread-buffer N   use N MiB thread buffers for each search thread [256]
      --merge-buffers N   merge 2^N thread buffers into one file per merge [6]
      --merge-jobs N      run N parallel merge jobs [4]

Step 3: Alter GBWT (requires -o and one input GBWT):
  -R, --remove-sample X   remove sample X from the index (may repeat)
      --set-tag K=V       set a GBWT tag (may repeat)
      --set-reference X   set sample X as the reference (may repeat)

Step 4: Path cover GBWT construction 
(requires an input graph, -o, and one of { -a, -l, -P }):
  -a, --augment-gbwt      add path cover of missing components (one input GBWT)
  -l, --local-haplotypes  sample local haplotypes (one input GBWT)
  -P, --path-cover        build a greedy path cover (no input GBWTs)
  -n, --num-paths N       find N paths per component[64 for -l, 16 otherwise]
  -k, --context-length N  use N-node contexts [4]
      --pass-paths        include named graph paths in local haplotype
                          or greedy path cover GBWT

Step 5: GBWTGraph construction (requires an input graph and one input GBWT):
  -g, --graph-name FILE   build GBWTGraph and store it in FILE
      --gbz-format        serialize both GBWT and GBWTGraph in GBZ format
                          (makes -o unnecessary)

Step 6: R-index construction (one input GBWT):
  -r, --r-index FILE      build an r-index and store it in FILE

Step 7: Metadata (one input GBWT):
  -M, --metadata          print basic metadata
  -C, --contigs           print the number of contigs
  -H, --haplotypes        print the number of haplotypes
  -S, --samples           print the number of samples
  -L, --list-names        list contig/sample names (use with -C or -S)
  -T, --path-names        list path names
      --tags              list GBWT tags

Step 8: Paths (one input GBWT):
  -c, --count-paths       print the number of paths
  -e, --extract FILE      extract paths in SDSL format to FILE

giraffe: fast haplotype-aware read alignment

usage:
  vg giraffe -Z graph.gbz [-d graph.dist [-m graph.withzip.min -z graph.zipcodes]] <input options> [other options] > output.gam
  vg giraffe -Z graph.gbz --haplotype-name graph.hapl --kff-name sample.kff <input options> [other options] > output.gam

Fast haplotype-aware read mapper.

basic options:
  -Z, --gbz-name FILE           map to this GBZ graph
  -m, --minimizer-name FILE     use this minimizer index
  -z, --zipcode-name FILE       use these additional distance hints
  -d, --dist-name FILE          cluster using this distance index
  -p, --progress                show progress
  -t, --threads N               number of mapping threads to use
  -b, --parameter-preset NAME   set computational parameters [default]
                                (chaining-sr / default / fast / hifi / r10 / srold)
  -h, --help                    print full help with all available options
input options:
  -G, --gam-in FILE             read and realign these GAM-format reads
  -f, --fastq-in FILE           read and align these FASTQ/FASTA-format reads
                                (two are allowed, one for each mate)
  -i, --interleaved             GAM/FASTQ/FASTA input is interleaved pairs,
                                for paired-end alignment
      --comments-as-tags        treat comments in name lines as SAM-style tags
                                and annotate alignments with them
haplotype sampling:
      --haplotype-name FILE     sample from haplotype information in FILE
      --kff-name FILE           sample according to kmer counts in FILE
      --index-basename STR      name prefix for generated graph/index files
                                (default: from graph name)
      --set-reference STR       include this sample as a reference
                                in the personalized graph (may repeat)
alternate graphs:
  -x, --xg-name FILE            map to this graph (if no -Z / -g),
                                or use this graph for HTSLib output
  -g, --graph-name FILE         map to this GBWTGraph (if no -Z)
  -H, --gbwt-name FILE          use this GBWT index (when mapping to -x / -g)
output options:
  -N, --sample NAME             add this sample name
  -R, --read-group NAME         add this read group
  -o, --output-format NAME      output the alignments in NAME format [gam]
                                {gam / gaf / json / tsv / SAM / BAM / CRAM} 
      --ref-paths FILE          ordered list of paths in the graph, one per line
                                or HTSlib .dict, for HTSLib @SQ headers
      --ref-name NAME           name of reference in the graph for HTSlib output
      --named-coordinates       make GAM/GAF output in named-segment (GFA) space

haplotypes: haplotype sampling based on kmer counts

usage:
    vg haplotypes [options] -k kmers.kff -g output.gbz graph.gbz
    vg haplotypes [options] -H output.hapl graph.gbz
    vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz

Haplotype sampling based on kmer counts.

Output files:
  -g, --gbz-output FILE        write the output GBZ to file (requires -k)
  -H, --haplotype-output FILE  write haplotype information to file

Input files:
  -d, --distance-index FILE    use this distance index [<basename>.dist]
  -r, --r-index FILE           use this r-index [<basename>.ri]
  -i, --haplotype-input FILE   use this .hapl file (default: generate)
  -k, --kmer-input FILE        use kmer counts from this KFF file

Options for generating haplotype information:
      --kmer-length N          kmer length for building minimizer index[29]
      --window-length N        window length for building minimizer index [11]
      --subchain-length N      target length (in bp) for subchains [10000]
      --linear-structure       extend subchains to avoid haplotypes
                               visiting them multiple times

Options for sampling haplotypes:
      --preset STR             use preset X {default, haploid, diploid}
      --coverage N             kmer coverage in KFF file (default: estimate)
      --num-haplotypes N       generate N haplotypes [4]
                               with --diploid-sampling, use N candidates [32]
      --present-discount F     discount scores for present kmers by factor F
                               [0.9]
      --het-adjustment F       adjust scores for heterozygous kmers by F [0.05]
      --absent-score F         score absent kmers -F/+F [0.8]
      --haploid-scoring        use a scoring model without heterozygous kmers
      --diploid-sampling       choose the best pair from the sampled haplotypes
      --extra-fragments        select all candidates in bad subchains
                               in --diploid-sampling
      --badness F              threshold for badness of a subchain [4]
      --include-reference      include named and reference paths in the output
      --set-reference NAME     use sample X as a reference sample (may repeat)

Other options:
  -v, --verbosity N            verbosity level [0]
                               {0 = silent, 1 = basic, 2 = detailed, 3 = debug}
  -t, --threads N              approximate number of threads [8 on this system]
  -h, --help                   print this help message to stderr and exit

ids: manipulate node ids

usage: vg ids [options] <graph1.vg> [graph2.vg ...] >new.vg
options:
  -c, --compact        minimize the space of integers used by the ids
  -i, --increment N    increase ids by N
  -d, --decrement N    decrease ids by N
  -j, --join           make a joint ID space for all supplied graphs
                       by iterating through the supplied graphs and incrementing
                       their ids to be non-conflicting (modifies original files)
  -m, --mapping FILE   create an empty node mapping for vg prune
  -s, --sort           assign new node IDs in generalized topological sort order
  -h, --help           print this help message to stderr and exit

index: index graphs or alignments for random access or mapping

usage: vg index [options] <graph1.vg> [graph2.vg ...]
Creates an index on the specified graph or graphs. All graphs indexed must 
already be in a joint ID space.
general options:
  -h, --help                print this help message to stderr and exit
  -b, --temp-dir DIR        use DIR for temporary files
  -t, --threads N           number of threads to use
  -p, --progress            show progress
xg options:
  -x, --xg-name FILE        use this file to store a succinct, queryable version
                            of graph(s), or read for GCSA or distance indexing
  -L, --xg-alts             include alt paths in xg
gcsa options:
  -g, --gcsa-out FILE       output a GCSA2 index to the given file
  -f, --mapping FILE        use this node mapping in GCSA2 construction
  -k, --kmer-size N         index kmers of size N in the graph [16]
  -X, --doubling-steps N    use N doubling steps for GCSA2 construction [4]
  -Z, --size-limit N        limit temp disk space usage to N GB [2048]
  -V, --verify-index        validate the GCSA2 index using the input kmers
                            (important for testing)
gam indexing options:
  -l, --index-sorted-gam    input is sorted .gam format alignments,
                            store a GAI index of the sorted GAM in INPUT.gam.gai
vg in-place indexing options:
      --index-sorted-vg     input is ID-sorted .vg format graph chunks
                            store a VGI index of the sorted vg in INPUT.vg.vgi
snarl distance index options
  -j, --dist-name FILE      use this file to store a snarl-based distance index
      --snarl-limit N       don't store distances for snarls > N nodes [10000]
                            if 0 then don't store distances, only the snarl tree
      --no-nested-distance  only store distances along the top-level chain
  -w, --upweight-node N     upweight the node with ID N to push it to be part
                            of a top-level chain (may repeat)

inject: lift over alignments for the graph

usage: vg inject -x graph.xg [options] input.[bam|sam|cram] >output.gam

options:
  -x, --xg-name FILE        use this graph or xg index (required, non-XG okay)
  -i, --add-identity        calculate & add 'identity' statistic to output GAM
  -r, --rescore             re-score alignments
  -o, --output-format NAME  output alignment format {gam / gaf / json} [gam]
  -t, --threads N           number of threads to use
  -h, --help                print this help message to stderr and exit

map: MEM-based read alignment

usage: vg map [options] -d idxbase -f in1.fq [-f in2.fq] >aln.gam
Align reads to a graph.

graph/index:
  -d, --base-name BASE             use BASE.xg and BASE.gcsa as input indexes
  -x, --xg-name FILE               use this xg index or graph [<graph>.vg.xg]
  -g, --gcsa-name FILE             use this GCSA2 index [<graph>.gcsa]
  -1, --gbwt-name FILE             use this GBWT haplotype index [<graph>.gbwt]
algorithm:
  -t, --threads N                  number of compute threads to use
  -k, --min-mem INT                minimum MEM length (if 0 estimate via -e) [0]
  -e, --mem-chance FLOAT           this fraction of -k length hits
                                   will be by chance [5e-4]
  -c, --hit-max N                  ignore MEMs who have >N hits in our index
                                   (0 for no limit) [2048]
  -Y, --max-mem INT                ignore MEMs longer than INT (unset if 0) [0]
  -r, --reseed-x FLOAT             look for internal seeds inside a seed
                                   longer than FLOAT*--min-seed [1.5]
  -u, --try-up-to INT              attempt to align up to the INT best candidate
                                   chains of seeds (1/2 for paired) [128]
  -l, --try-at-least INT           attempt to align at least the INT best
                                   candidate chains of seeds [1]
  -E, --approx-mq-cap INT          weight MQ by suffix tree based estimate
                                   when estimate less than FLOAT [0]
  -7, --id-mq-weight N             scale mapping quality by the alignment score
                                   identity to this power [2]
  -W, --min-chain INT              discard a chain if seeded bases are
                                   shorter than INT [0]
  -C, --drop-chain FLOAT           drop chains shorter than FLOAT fraction of
                                   the longest overlapping chain [0.45]
  -n, --mq-overlap FLOAT           scale MQ by count of alignments with FLOAT
                                   overlap in the query with the primary [0]
  -P, --min-ident FLOAT            accept alignment only if the alignment
                                   identity is >= FLOAT [0]
  -H, --max-target-x N             skip cluster subgraphs with
                                   length > N*read_length [100]
  -w, --band-width INT             band width for long read alignment [256]
  -O, --band-overlap INT           band overlap for long read alignment [{-w}/8]
  -J, --band-jump INT              the maximum number of bands of insertion we
                                   consider in the alignment chain model [128]
  -B, --band-multi INT             consider this many alignments of each band
                                   in banded alignment [16]
  -Z, --band-min-mq INT            treat bands with < INT MQ as unaligned [0]
  -I, --fragment STR               fragment length distribution specification
                                   STR=m:μ:σ:o:d [5000:0:0:0:1]
                                   max:mean:stdev:orientation (1=same/0=flip):
                                   direction (1=forward, 0=backward)
  -U, --fixed-frag-model           don't learn the pair fragment model online,
                                   use -I without update
  -p, --print-frag-model           suppress alignment output and print the
                                   fragment model on stdout as per -I format
  -4, --frag-calc INT              update the fragment model
                                   every INT perfect pairs [10]
  -3, --fragment-x FLOAT           calculate max fragment size as
                                   frag_mean+frag_sd*FLOAT [10]
  -0, --mate-rescues INT           attempt up to INT mate rescues per pair [64]
  -S, --unpaired-cost INT          penalty for an unpaired read pair [17]
  -8, --no-patch-aln               do not patch banded alignments by
                                   locally aligning unaligned regions
      --xdrop-alignment            use X-drop heuristic
                                   (much faster for long-read alignment)
      --max-gap-length INT         maximum gap length allowed in each contiguous
                                   alignment (for X-drop alignment) [40]
scoring:
  -q, --match INT                  use this match score [1]
  -z, --mismatch INT               use this mismatch penalty [4]
      --score-matrix FILE          use this 4x4 integer substitution scoring
                                   matrix (in the order ACGT)
  -o, --gap-open INT               use this gap open penalty [6]
  -y, --gap-extend INT             use this gap extension penalty [1]
  -L, --full-l-bonus INT           the full-length alignment bonus [5]
  -2, --drop-full-l-bonus          remove the full length bonus from the score
                                   before sorting and MQ calculation
  -a, --hap-exp FLOAT              the exponent for haplotype consistency
                                   likelihood in alignment score [1]
      --recombination-penalty NUM  use this log recombination penalty
                                   for GBWT haplotype scoring [20.7]
  -A, --qual-adjust                perform base quality adjusted alignments
                                   (requires base quality input)
preset:
  -m, --alignment-model STR        use a preset alignment scoring model, either
                                   "short" (default) or "long" (ONT/PacBio)
                                   "long" is equivalent to
                                   `-u 2 -L 63 -q 1 -z 2 -o 2 -y 1 -w 128 -O 32`
input:
  -s, --sequence STR               align a string to the graph in graph.vg
                                   using partial order alignment
  -V, --seq-name STR               name the sequence STR
                                   (for graph modification with new named paths)
  -T, --reads FILE                 take reads (one per line) from FILE,
                                   write alignments to stdout
  -b, --hts-input FILE             align reads from stdin htslib-compatible FILE
                                   (BAM/CRAM/SAM), alignments to stdout
  -G, --gam-input FILE             realign GAM input
  -f, --fastq FILE                 input FASTQ or (2-line format) FASTA, maybe
                                   compressed; two allowed, one for each mate
  -F, --fasta FILE                 align the sequences in a FASTA file that may
                                   have multiple lines per reference sequence
      --comments-as-tags           intepret comments in name lines as SAM-style
                                   tags and annotate alignments with them
  -i, --interleaved                FASTQ or GAM is interleaved paired-ended
  -N, --sample NAME                for --reads input, add this sample
  -R, --read-group NAME            for --reads input, add this read group
output:
  -j, --output-json                output JSON rather than an alignment stream
                                   (helpful for debugging)
  -%, --gaf                        output alignments in GAF format
  -5, --surject-to TYPE            surject the output into the graph's paths,
                                   writing TYPE {bam, sam, cram}
      --ref-paths FILE             ordered list of paths in graph, one per line
                                   or HTSlib .dict, for HTSLib @SQ headers
      --ref-name NAME              reference assembly in graph for HTSlib output
  -9, --buffer-size INT            buffer this many alignments together
                                   before outputting in GAM [512]
  -X, --compare                    realign -G GAM input, writing alignment with
                                   "correct" field set to overlap with input
  -v, --refpos-table               for efficient testing output a table of
                                   name, chr, pos, mq, score
  -K, --keep-secondary             produce alignments for secondary input
                                   alignments in addition to primary ones
  -M, --max-multimaps INT          produce up to INT alignments per read [1]
  -Q, --mq-max INT                 cap the mapping quality at INT [60]
      --exclude-unaligned          exclude reads with no alignment
  -D, --debug                      print debugging information to stderr
  -^, --log-time                   print runtime to stderr
  -h, --help                       print this help message to stderr and exit

minimizer: build a minimizer index or a syncmer index

usage: vg minimizer [options] -d graph.dist -o graph.min graph

Builds a (w, k)-minimizer index or a (k, s)-syncmer index of the threads in the
GBWT. The graph can be any HandleGraph, which will be made into a GBWTGraph.
The transformation can be avoided by providing a GBWTGraph or a GBZ graph.

Required options:
  -d, --distance-index FILE  annotate hits with positions in this distance index
  -o, --output-name FILE     store the index in a file

Minimizer options:
  -k, --kmer-length N        length of the kmers in the index [29] (max 31)
  -w, --window-length N      choose minimizer from a window of N kmers [11]
  -c, --closed-syncmers      index closed syncmers instead of minimizers
  -s, --smer-length N        use smers of length N in closed syncmers [18]

Weighted minimizers:
  -W, --weighted             use weighted minimizers
      --threshold N          downweight kmers with more than N hits [500]
      --iterations N         downweight frequent kmers by N iterations [3]
      --fast-counting        use the fast kmer counting algorithm (default)
      --save-memory          use the space-efficient kmer counting algorithm
      --hash-table N         use 2^N-cell hash tables for kmer counting
                             (default: guess)

Other options:
  -z, --zipcode-name FILE    store the distances that are too big in afile
                             if no -z, some distances may be discarded
  -l, --load-index FILE      load this index and insert the new kmers into it
                             (overrides minimizer / weighted minimizer options)
  -g, --gbwt-name FILE       use this GBWT index (required with a non-GBZ graph)
  -E, --rec-mode             use recombination-aware MinimizerIndex
  -p, --progress             show progress information
  -t, --threads N            use N threads for index construction [8]
                             (using more than 16 threads rarely helps)
      --no-dist              build the index without distance index annotations
                             (not recommended)
  -h, --help                 print this help message to stderr and exit

mod: filter, transform, and edit the graph

usage: vg mod [options] <graph.vg> >[mod.vg]
Modifies graph, outputs modified on stdout.

options:
  -c, --compact-ids        should we sort and compact the ID space? (default no)
  -b, --break-cycles       break graph cycles with approximate topological sort
  -n, --normalize          normalize graph so edges are always non-redundant
                           (nodes have unique starting and ending bases relative
                           to neighbors, edges that do not introduce new paths
                           are removed, and neighboring nodes are merged)
  -U, --until-normal N     iterate normalization at most N times
  -z, --nomerge-pre STR    do not let normalize (-n/-U) zip up any pair of nodes
                           that both belong to path with prefix STR
  -E, --unreverse-edges    flip doubly-reversing edges so that they are
                           represented on the forward strand of the graph
  -s, --simplify           remove redundancy from the graph
                           that will not change its path space
  -d, --dagify-step N      copy strongly connected components of graph N times,
                           forwording edges from old to new copies
                           to convert the graph into a DAG
  -w, --dagify-to N        copy strongly connected components of the graph,
                           forwarding edges from old to new copies
                           to convert the graph into a DAG
                           until shortest path through each SCC is N bases long
  -L, --dagify-len-max N   stop a dagification step if the unrolling component
                           has this much sequence
  -f, --unfold N           represent inversions accessible up to N from
                           the forward component of the graph
  -O, --orient-forward     orient the nodes in the graph forward
  -N, --remove-non-path    keep only nodes and edges which are part of paths
  -A, --remove-path        keep only nodes and edges which aren't part of a path
  -k, --keep-path NAME     keep only nodes and edges in the path (may repeat)
  -V, --invert-keep-path   keep only nodes and edges in paths not passed to -k
  -R, --remove-null        remove nodes with no sequence, forwarding their edges
  -g, --subgraph ID        gets the subgraph rooted at node ID (may repeat)
  -x, --context N          steps the subgraph out by N steps [1]
  -p, --prune-complex      remove nodes that are reached by paths of --length
                           which cross more than --edge-max edges
  -S, --prune-subgraphs    remove subgraphs which are shorter than --length
  -l, --length N           for pruning complex regions and short subgraphs
  -X, --chop N             chop nodes in the graph so they are <=N bp long
  -u, --unchop             where two nodes are only connected to each other and
                           by only one edge, replace the pair with a single node
                           that is the concatenation of their labels
  -e, --edge-max N         consider paths which make edge choices at <= N points
  -M, --max-degree N       unlink nodes that have edge degree greater than N
  -m, --markers            join all head and tails nodes to marker nodes
                           (### starts and $$$ ends) of --length, for debugging
  -y, --destroy-node ID    remove node with given id
  -a, --cactus             convert to cactus graph representation
  -v, --sample-vcf FILE    for a graph with allele paths,
                           compute the sample graph from the given VCF
  -G, --sample-graph FILE  subset augmented graph to sample graph via Locus file
  -t, --threads N          for parallel tasks, use this many threads
  -h, --help               print this help message to stderr and exit

mpmap: splice-aware multipath alignment of short reads

usage: vg mpmap [options] -x graph.xg -g index.gcsa [-f reads1.fq [-f reads2.fq] | -G reads.gam] > aln.gamp
Multipath align reads to a graph.

basic options:
  -h, --help                print this help message to stderr and exit
graph/index:
  -x, --graph-name FILE     graph (required; XG recommended but other formats
                            are acceptable: see `vg convert`)
  -g, --gcsa-name FILE      use this GCSA2 (FILE) & LCP (FILE.lcp) index pair
                            for MEMs (required; see `vg index`)
  -d, --dist-name FILE      use this snarl distance index for clustering
                            (recommended, see `vg index`)
  -s, --snarls FILE         align to alternate paths in these snarls
                            (unnecessary if providing -d, see `vg snarls`)
input:
  -f, --fastq FILE          input FASTQ (possibly gzipped), can be given twice
                            for paired ends (for stdin use -)
  -i, --interleaved         input contains interleaved paired ends
  -C, --comments-as-tags    intepret comments in name lines as SAM-style tags
                            and annotate alignments with them
algorithm presets:
  -n, --nt-type TYPE        sequence type preset: 'DNA' for genomic data,
                            'RNA' for transcriptomic data [RNA]
  -l, --read-length TYPE    read length preset: {very-short, short, long}
                            (approx. <50bp, 50-500bp, and >500bp) [short]
  -e, --error-rate TYPE     error rate preset: {low, high}
                            (approx. PHRED >20 and <20) [low]
output:
  -F, --output-fmt TYPE     format to output alignments in:
                            'GAMP' for multipath alignments,
                            'GAM'/'GAF' for single-path alignments,
                            'SAM'/'BAM'/'CRAM' for linear reference alignments
                            (may also require -S) [GAMP]
  -S, --ref-paths FILE      paths in graph are 1) one per line in a text file
                            or 2) in an HTSlib .dict, to treat as
                            reference sequences for HTSlib formats (see -F)
                            [all reference paths, all generic paths]
      --ref-name NAME       reference assembly in graph to use for
                            HTSlib formats (see -F) [all references]
  -N, --sample NAME         add this sample name to output
  -R, --read-group NAME     add this read group to output
  -p, --suppress-progress   do not report progress to stderr
computational parameters:
  -t, --threads INT         number of compute threads to use [all available]

advanced options:
algorithm:
  -X, --not-spliced         do not form spliced alignments, even with -n RNA
  -M, --max-multimaps INT   report up to INT mappings per read [10 RNA / 1 DNA]
  -a, --agglomerate-alns    combine separate multipath alignments into
                            one (possibly disconnected) alignment
  -r, --intron-distr FILE   intron length distribution
                            (from scripts/intron_length_distribution.py)
  -Q, --mq-max INT          cap mapping quality estimates at this much [60]
  -b, --frag-sample INT     look for INT unambiguous mappings to
                            estimate the fragment length distribution [1000]
  -I, --frag-mean FLOAT     mean for pre-determined fragment length distribution
                            (also requires -D)
  -D, --frag-stddev FLOAT   standard deviation for pre-determined fragment
                            length distribution (also requires -I)
  -G, --gam-input FILE      input GAM (for stdin, use -)
  -u, --map-attempts INT    perform up to INT mappings per read (0 for no limit)
                            [24 paired / 64 unpaired]
  -c, --hit-max INT         use at most this many hits for any match seeds
                            (0 for no limit) [1024 DNA / 100 RNA]
scoring:
  -A, --no-qual-adjust      do not perform base quality adjusted alignments
                            even when base qualities are available
  -q, --match INT           use INT match score [1]
  -z, --mismatch INT        use INT mismatch penalty [4 low error, 1 high error]
  -o, --gap-open INT        use INT gap open penalty [6 low error, 1 high error]
  -y, --gap-extend INT      use INT gap extension penalty [1]
  -L, --full-l-bonus INT    add INT score to alignments that align each
                            end of the read [mismatch+1 short, 0 long]
  -w, --score-matrix FILE   use this 4x4 integer substitution scoring matrix
                            (in the order ACGT)
  -m, --remove-bonuses      remove full length alignment bonus in reported score

pack: convert alignments to a compact coverage index

usage: vg pack [options]
options:
  -x, --xg FILE          use this basis graph (does not have to be xg format)
  -o, --packs-out FILE   write compressed coverage packs to this output file
  -i, --packs-in FILE    begin by summing coverage packs from each provided FILE
  -g, --gam FILE         read alignments from this GAM file ('-' for stdin)
  -a, --gaf FILE         read alignments from this GAF file ('-' for stdin)
  -d, --as-table         write table on stdout representing packs
  -D, --as-edge-table    write table on stdout representing edge coverage
  -u, --as-qual-table    write table on stdout representing average node mapqs
  -e, --with-edits       record and write edits
                         rather than only recording graph-matching coverage
  -b, --bin-size N       number of sequence bases per CSA bin [inf]
  -n, --node ID          write table for only specified node(s)
  -N, --node-list FILE   white space or line delimited list of nodes to collect
  -Q, --min-mapq N       ignore reads with MAPQ < N
                         and positions with base quality < N [0]
  -c, --expected-cov N   expected coverage.  used only for memory tuning [128]
  -s, --trim-ends N      ignore the first and last N bases of each read
  -t, --threads N        use N threads [numCPUs]
  -h, --help             print this help message to stderr and exit

paths: traverse paths in the graph

usage: vg paths [options]
  -h, --help               print this help message to stderr and exit
input:
  -x, --xg FILE            use the paths and haplotypes in this graph FILE
                           Supports GBZ haplotypes. (also accepts -v, --vg)
  -g, --gbwt FILE          use the threads in the GBWT index in FILE
                           (graph also required for most output options;
                           -g takes priority over -x)
output graph (.vg format):
  -V, --extract-vg         output a path-only graph covering the selected paths
  -d, --drop-paths         output a graph with the selected paths removed
  -r, --retain-paths       output a graph with only the selected paths retained
  -n, --normalize-paths    output a graph where equivalent paths in a site are
                           merged (using selected paths to snap to if possible)
output path data:
  -X, --extract-gam        print (as GAM alignments) stored paths in the graph
  -A, --extract-gaf        print (as GAF alignments) stored paths in the graph
  -L, --list               print (one per line) path (or thread) names
  -E, --lengths            print a list of path names (as with -L)
                           but paired with their lengths
  -M, --metadata           print a table of path names and their metadata
  -C, --cyclicity          print a list of path names (as with -L)
                           but paired with flag denoting the cyclicity
  -F, --extract-fasta      print the paths in FASTA format
  -c, --coverage           print the coverage stats for selected paths
                           (not including cycles)
path selection:
  -p, --paths-file FILE    select paths named in a file (one per line)
  -Q, --paths-by STR       select paths with the given name prefix
  -S, --sample STR         select haplotypes or reference paths for this sample
  -a, --variant-paths      select variant paths added by 'vg construct -a'
  -G, --generic-paths      select generic, non-reference, non-haplotype paths
  -R, --reference-paths    select reference paths
  -H, --haplotype-paths    select haplotype paths
configuration:
  -o, --overlay            apply a ReferencePathOverlayHelper to the graph
  -t, --threads N          number of threads to use [all available]
                           applies only to snarl finding within -n

prune: prune the graph for GCSA2 indexing

usage: vg prune [options] <graph.vg> >[output.vg]

Prunes the complex regions of the graph for GCSA2 indexing.
Pruning the graph removes embedded paths.

Pruning parameters:
  -k, --kmer-length N    kmer length used for pruning
                         defaults: 24 with -P; 24 with -r; 24 with -u
  -e, --edge-max N       remove the edges on kmers making > N edge choices
                         defaults: 3 with -P; 3 with -r; 3 with -u
  -s, --subgraph-min N   remove subgraphs of < N bases
                         defaults: 33 with -P; 33 with -r; 33 with -u
  -M, --max-degree N     if N > 0, remove nodes with degree > N before pruning
                         defaults: 0 with -P; 0 with -r; 0 with -u

Pruning modes (-P, -r, and -u are mutually exclusive):
  -P, --prune            simply prune the graph (default)
  -r, --restore-paths    restore the edges on non-alt paths
  -u, --unfold-paths     unfold non-alt paths and GBWT threads
  -v, --verify-paths     verify that the paths exist after pruning
                         (potentially very slow)

Unfolding options:
  -g, --gbwt-name FILE   unfold the threads from this GBWT index
  -m, --mapping FILE     store node mapping for duplicates (required with -u)
  -a, --append-mapping   append to the existing node mapping

Other options:
  -p, --progress         show progress
  -t, --threads N        use N threads [8]
  -d, --dry-run          determine the validity of the combination of options
  -h, --help             print this help message to stderr and exit

rna: construct splicing graphs and pantranscriptomes

usage: vg rna [options] graph.[vg|pg|hg|gbz] > splicing_graph.[vg|pg|hg]

General options:
  -t, --threads INT          number of compute threads to use [1]
  -p, --progress             show progress
  -h, --help                 print this help message to stderr and exit

Input options:
  -n, --transcripts FILE     transcript file(s) in gtf/gff format (may repeat)
  -m, --introns FILE         intron file(s) in bed format (may repeat)
  -y, --feature-type NAME    parse only this feature type in the GTF/GFF
                             (parses all if empty) [exon]
  -s, --transcript-tag NAME  use this attribute tag in the GTF/GFf file(s) as ID
                             to group exons and name paths [transcript_id]
  -l, --haplotypes FILE      project transcripts onto haplotypes in GBWT index
  -z, --gbz-format           input graph is GBZ format (has graph & GBWT index)

Construction options:
  -j, --use-hap-ref          use haplotype paths in GBWT index as references
                             (disables projection)
  -e, --proj-embed-paths     project transcripts onto embedded haplotype paths
  -c, --path-collapse TYPE   collapse identical transcript paths across
                             no|haplotype|all paths [haplotype]
  -k, --max-node-length INT  chop nodes longer than INT (disable with 0) [0]
  -d, --remove-non-gene      remove intergenic and intronic regions
                             (deletes all paths in the graph)
  -o, --do-not-sort          do not topological sort and compact the graph
DON'T FORGET TO EMBED PATHS:
  -r, --add-ref-paths        add reference transcripts as embedded paths
  -a, --add-hap-paths        add projected transcripts as embedded paths

Output options:
  -b, --write-gbwt FILE      write pantranscriptome transcript paths as GBWT
  -v, --write-hap-gbwt FILE  write input haplotypes as a GBWT
                             with node IDs matching the output graph
  -f, --write-fasta FILE     write pantranscriptome transcript sequences to here
  -i, --write-info FILE      write pantranscriptome transcript info table as TSV
  -q, --out-exclude-ref      exclude reference transcripts from pantranscriptome
  -g, --gbwt-bidirectional   use bidirectional paths in GBWT index construction

sim: simulate reads from a graph

usage: vg sim [options]
Samples sequences from the xg-indexed graph.

basic options:
  -h, --help                  print this help message to stderr and exit
  -x, --xg-name FILE          use the graph in FILE (required)
  -n, --num-reads N           simulate N reads or read pairs
  -l, --read-length N         simulate reads of length N
  -r, --progress              show progress information
output options:
  -a, --align-out             write alignments in GAM-format
  -q, --fastq-out             write reads in FASTQ format
  -J, --json-out              write alignments in JSON-format GAM (implies -a)
      --multi-position        annotate with multiple reference positions
simulation parameters:
  -F, --fastq FILE            match the error profile of NGS reads in FILE,
                              repeat for paired reads (ignores -l,-f)
  -I, --interleaved           reads in FASTQ (-F) are interleaved read pairs
  -s, --random-seed N         use this specific seed for the PRNG
  -e, --sub-rate FLOAT        base substitution rate [0.0]
  -i, --indel-rate FLOAT      indel rate [0.0]
  -d, --indel-err-prop FLOAT  proportion of trained errors from -F
                              that are indels [0.01]
  -S, --scale-err FLOAT       scale trained error probs from -F by FLOAT [1.0]
  -f, --forward-only          don't simulate from the reverse strand
  -p, --frag-len N            make paired end reads with fragment length N
  -v, --frag-std-dev FLOAT    use this standard deviation
                              for fragment length estimation
  -N, --allow-Ns              allow reads to be sampled with Ns in them
      --max-tries N           attempt sampling operations up to N times [100]
  -t, --threads N             number of compute threads (only when using -F) [1]
simulate from paths:
  -P, --path NAME             simulate from this path
                              (may repeat; cannot also give -T)
  -A, --any-path              simulate from any path (overrides -P)
  -m, --sample-name NAME      simulate from this sample (may repeat)
  -R, --ploidy-regex RULES    use this comma-separated list of colon-delimited
                              REGEX:PLOIDY rules to assign ploidies to contigs
                              not visited by the selected samples, or to all
                              contigs simulated from if no samples are used.
                              Unmatched contigs get ploidy 2
  -g, --gbwt-name FILE        use samples from this GBWT index
  -T, --tx-expr-file FILE     simulate from an expression profile formatted as
                              RSEM output (cannot also give -P)
  -H, --haplo-tx-file FILE    transcript origin info table from vg rna -i
                              (required for -T on haplotype transcripts)
  -u, --unsheared             sample from unsheared fragments
  -E, --path-pos-file FILE    output a TSV with sampled position on path
                              of each read (requires -F)

stats: metrics describing graph and alignment properties

usage: vg stats [options] [<graph file>]
options:
  -z, --size               size of graph
  -N, --node-count         number of nodes in graph
  -E, --edge-count         number of edges in graph
  -l, --length             length of sequences in graph
  -L, --self-loops         number of self-loops
  -s, --subgraphs          describe subgraphs of graph
  -H, --heads              list the head nodes of the graph
  -T, --tails              list the tail nodes of the graph
  -e, --nondeterm          list the nondeterministic edge sets
  -c, --components         print the strongly connected components of the graph
  -A, --is-acyclic         print if the graph is acyclic or not
  -n, --node ID            consider node with the given id
  -d, --to-head            show distance to head for each provided node
  -t, --to-tail            show distance to head for each provided node
  -a, --alignments FILE    compute stats for reads aligned to the graph
  -r, --node-id-range      X:Y where X and Y are the smallest and largest
                           node id in the graph, respectively
  -o, --overlap PATH       for each overlapping path mapping in the graph write:
                              PATH, other_path, rank1, rank2
                           multiple allowed; limit comparison to those provided
  -O, --overlap-all        print overlap table for cartesian product of paths
  -R, --snarls             print statistics for each snarl
      --snarl-contents     print table of <snarl, depth, parent, node ids>
      --snarl-sample NAME  print out reference coordinates on given sample
  -C, --chains             print statistics for each chain
  -F, --format             graph type {VG-Protobuf, PackedGraph, HashGraph, XG}
                           Can't detect Protobuf if graph read from stdin
  -D, --degree-dist        print degree distribution of the graph.
  -b, --dist-snarls FILE   print sizes/depths of the snarls in distance index
  -p, --threads N          number of threads to use [all available]
  -v, --verbose            output longer reports
  -P, --progress           show progress
  -h, --help               print this help message to stderr and exit

surject: map alignments onto specific paths

usage: vg surject [options] <aln.gam> >[proj.cram]
Transforms alignments to be relative to particular paths.

options:
  -x, --xg-name FILE        use this graph or xg index (required)
  -t, --threads N           number of threads to use
  -p, --into-path NAME      surject into this path or its subpaths (may repeat)
                            default: reference, then non-alt generic
  -F, --into-paths FILE     surject into path names listed in
                            HTSlib sequence dictionary or path list FILE
  -n, --into-ref NAME       surject into this reference assembly
  -i, --interleaved         GAM is interleaved paired-ended, so pair reads
                            when outputting HTS formats
  -M, --multimap            include secondary alignments to all
                            overlapping paths instead of just primary
  -G, --gaf-input           input file is GAF instead of GAM
  -m, --gamp-input          input file is GAMP instead of GAM
  -c, --cram-output         write CRAM to stdout
  -b, --bam-output          write BAM to stdout
  -s, --sam-output          write SAM to stdout
  -u, --supplementary       divide into supplementary alignments as necessary
  -l, --subpath-local       let the multipath mapping surjection produce local
                            (rather than global) alignments
  -T, --max-tail-len N      only align up to N bases of read tails [10000]
  -g, --max-graph-scale X   make reads unmapped if alignment target subgraph
                            size exceeds read length by a factor of X 
                            (default: 819.2 or 134218 with -S)
  -P, --prune-low-cplx      prune short/low complexity anchors in realignment
  -I, --max-slide N         look for offset duplicates of anchors up to N bp
                            away when pruning (default: 6)
  -a, --max-anchors N       use <= N anchors per target path [unlimited]
  -S, --spliced             interpret long deletions against paths
                            as spliced alignments
  -A, --qual-adj            adjust scoring for base qualities, if available
  -E, --extra-gap-cost N    for dynamic programming, add N to the gap open cost
                            of the 10x-scaled scoring parameters
  -N, --sample NAME         set this sample name for all reads
  -R, --read-group NAME     set this read group for all reads
  -f, --max-frag-len N      reads with fragment lengths greater than N won't be
                            marked properly paired in SAM/BAM/CRAM
  -L, --list-all-paths      annotate SAM records with a list of all attempted
                            re-alignments to paths in SS tag
  -H, --graph-aln           annotate SAM records with cs-style difference string
                            of the pre-surjected graph alignment in GR tag
  -C, --compression N       level for compression [0-9]
  -V, --no-validate         skip checking whether alignments plausibly are
                            against the provided graph
  -w, --watchdog-timeout N  warn when reads take more than N seconds to surject
  -r, --progress            show progress
  -h, --help                print this help message to stderr and exit

view: format conversions for graphs and alignments

usage: vg view [options] [ <graph.vg> | <graph.json> | <aln.gam> | <read1.fq> [<read2.fq>] ]
options:
  -g, --gfa                 output GFA format (default)
  -F, --gfa-in              input GFA format, reducing overlaps if they occur
  -v, --vg                  output VG format [DEPRECATED, use vg convert]
  -V, --vg-in               input VG format only
  -j, --json                output JSON format
  -J, --json-in             input JSON format (use with e.g. -a as necessary)
  -c, --json-stream         streaming conversion of a VG format graph
                            in line delimited JSON format
                            (this cannot be loaded directly via -J)
  -G, --gam                 output GAM format (vg alignment format)
  -Z, --translation-in      input is a graph translation description
  -t, --turtle              output RDF/turtle format (can not be loaded by VG)
  -T, --turtle-in           input turtle format.
  -r, --rdf-base-uri URI    set base uri for the RDF output
  -a, --align-in            input GAM format, or JSON version of GAM format
  -A, --aln-graph GAM       add alignments from GAM to the graph
  -q, --locus-in            input is Locus format, or JSON version of it
  -z, --locus-out           output is Locus format
  -Q, --loci FILE           input is Locus format for use by dot output
  -d, --dot                 output dot format
  -S, --simple-dot          simple alignments & no node labels in dot output
  -u, --noseq-dot           show size instead of sequence in dot output
  -e, --ascii-labels        label paths/superbubbles with char/colors vs. emoji
  -Y, --ultra-label         label nodes with emoji/colors for ultrabubbles
  -m, --skip-missing        skip mappings to nodes not in the graph
                            when drawing alignments
  -C, --color               color nodes not in reference path (DOT OUTPUT ONLY)
  -p, --show-paths          show paths in dot output
  -w, --walk-paths          add labeled edges to represent paths in dot output
  -n, --annotate-paths      add labels to edges to represent paths in dot output
  -M, --show-mappings       with -p, print the mappings in each path in JSON
  -I, --invert-ports        invert edge ports in dot so that ne->nw is reversed
  -s, --random-seed N       use this seed for path symbols in dot output
  -b, --bam                 input BAM or other htslib-parseable alignments
  -f, --fastq-in            input fastq (output defaults to GAM). Takes two
                            positional file arguments if paired
  -X, --fastq-out           output fastq (input defaults to GAM)
  -i, --interleaved         fastq is interleaved paired-ended
  -L, --pileup              output VG Pileup format
  -l, --pileup-in           input VG Pileup format, or JSON version of it
  -B, --distance-in         input distance index
  -R, --snarl-in            input VG Snarl format
  -E, --snarl-traversal-in  input VG SnarlTraversal format
  -K, --multipath-in        input VG MultipathAlignment format (GAMP),
                            or JSON version of it
  -k, --multipath           output VG MultipathAlignment format (GAMP)
  -D, --expect-duplicates   don't warn about duplicate nodes or edges
  -x, --extract-tag TAG     extract and concatenate messages with the given tag
      --first               only extract first message with the requested tag
      --verbose             explain the file being read with --extract-tag
  -7, --threads N           for parallel operations use this many threads [1]
  -h, --help                print this help message to stderr and exit

BUGS

Bugs can be reported at: https://github.com/vgteam/vg/issues

For technical support, please visit: https://www.biostars.org/tag/vg/

Start here

vg Manpage

Build VG (or use it in Docker)

File Formats

VG Roadmap

vg manpage

NAME

DESCRIPTION

SYNOPSIS

COMMANDS

annotate: annotate alignments with graphs and graphs with alignments

autoindex: mapping tool-oriented index construction from interchange formats

call: call or genotype VCF variants

chunk: split graph or alignment into chunks

construct: graph construction

convert: convert graphs between handle-graph compliant formats as well as GFA

deconstruct: create a VCF from variation in the graph

filter: filter reads and get statistics by read

find: use an index to find nodes, edges, kmers, paths, or positions

gamsort: sort a GAM/GAF file or index a sorted GAM file

gbwt: build and manipulate GBWT and GBZ files

giraffe: fast haplotype-aware read alignment

haplotypes: haplotype sampling based on kmer counts

ids: manipulate node ids

index: index graphs or alignments for random access or mapping

inject: lift over alignments for the graph

map: MEM-based read alignment

minimizer: build a minimizer index or a syncmer index

mod: filter, transform, and edit the graph

mpmap: splice-aware multipath alignment of short reads

pack: convert alignments to a compact coverage index

paths: traverse paths in the graph

prune: prune the graph for GCSA2 indexing

rna: construct splicing graphs and pantranscriptomes

sim: simulate reads from a graph

stats: metrics describing graph and alignment properties

surject: map alignments onto specific paths

view: format conversions for graphs and alignments

BUGS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!