Skip to content

vg manpage

Adam Novak edited this page Oct 16, 2025 · 8 revisions

% vg(1) | Variation Graph Toolkit

NAME

vg - variation graph tool, vg version v1.69.0 "Bologna".

DESCRIPTION

vg is a toolkit for variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods.

For more in-depth explanations of tools and workflows, see the vg wiki page.

SYNOPSIS

This is an incomplete list of vg subcommands. For a complete list, run vg help.

  • Graph construction and indexing See the wiki page for an overview of vg indexes.
  • Read mapping
  • Downstream analyses
  • Working with read alignments
    • vg gamsort: sort a GAM/GAF file or index a sorted GAM file.
    • vg filter: filter alignments by properties.
    • vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram).
    • vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf).
    • vg sim: simulate reads from a graph. wiki page
  • Graph and read statistics
  • Manipulating a graph
  • Conversion between formats
    • vg convert: convert between handle graph formats and GFA, and between alignment formats.
    • vg view: convert between non-handle graph formats and alignment formats (dot, json, turtle...).
    • vg surject: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram).
    • vg inject: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf).
    • vg paths: extract a fasta from a graph. wiki page
  • Subgraph extraction
    • vg chunk: split a graph and/or alignment into smaller chunks.
    • vg find: use an index to find nodes, edges, kmers, paths, or positions.

COMMANDS

annotate: annotate alignments with graphs and graphs with alignments

usage: vg annotate [options] >output.{gam,vg,tsv}
graph annotation options:
  -x, --xg-name FILE     xg index or graph to annotate (required)
  -b, --bed-name FILE    BED file to convert to GAM (may repeat)
  -f, --gff-name FILE    GFF3 file to convert to GAM (may repeat)
  -g, --ggff             output GGFF subgraph annotation file
                         instead of GAM (requires -s)
  -F, --gaf-output       output in GAF format rather than GAM
  -s, --snarls FILE      snarls to expand GFF intervals into
alignment annotation options:
  -a, --gam FILE         alignments to annotate (required)
  -x, --xg-name FILE     xg index of the graph against which the
                         alignments are aligned (required)
  -p, --positions        annotate alignments with reference positions
  -m, --multi-position   annotate alignments with multiple reference positions
  -l, --search-limit N   when annotating with -p, search this far for paths, or
                         -1 to not search [0 (auto from read length)]
  -b, --bed-name FILE    annotate alignments with overlapping region names
                         from this BED (may repeat)
  -n, --novelty          output TSV table with header
                         describing how much of each Alignment is novel
  -P, --progress         show progress
  -t, --threads N        use the specified number of threads
  -h, --help             print this help message to stderr and exit

autoindex: mapping tool-oriented index construction from interchange formats

usage: vg autoindex [options]
output:
  -p, --prefix PREFIX    prefix to use for all output [index]
  -w, --workflow NAME    workflow to produce indexes for (may repeat) [map]
                         {map, mpmap, rpvg, giraffe, sr-giraffe, lr-giraffe}
input data:
  -r, --ref-fasta FILE   FASTA file with the reference sequence (may repeat)
  -v, --vcf FILE         VCF file with sequence names matching -r (may repeat)
  -i, --ins-fasta FILE   FASTA file with sequences of INS variants from -v
  -g, --gfa FILE         GFA file to make a graph from
  -G, --gbz FILE         GBZ file to make indexes from
  -x, --tx-gff FILE      GTF/GFF file with transcript annotations (may repeat)
  -H, --hap-tx-gff FILE  GTF/GFF file with transcript annotations 
                         of a named haplotype (may repeat)
  -n, --no-guessing      do not guess that pre-existing files are indexes
                         i.e. force-regenerate any index not explicitly provided
configuration:
  -f, --gff-feature STR  GTF/GFF feature type (col. 3) to add to graph [exon]
  -a, --gff-tx-tag STR   GTF/GFF tag (in col. 9) for ID [transcript_id]
logging and computation:
  -T, --tmp-dir DIR      temporary directory to use for intermediate files
  -M, --target-mem MEM   target max memory usage (not exact, formatted INT[kMG])
                         [1/2 of available]
  -t, --threads NUM      number of threads [all available]
  -V, --verbosity NUM    log to stderr {0 = none, 1 = basic, 2 = debug}[1]
  -h, --help             print this help message to stderr and exit

call: call or genotype VCF variants

usage: vg call [options] <graph> > output.vcf
Call variants or genotype known variants

support calling options:
  -k, --pack FILE           supports created from vg pack for given input graph
  -m, --min-support M,N     min allele (M) and site (N) support to call [2,4]
  -e, --baseline-error X,Y  baseline error rates for Poisson model for small (X)
                            and large (Y) variants [0.005,0.01]
  -B, --bias-mode           use old ratio-based genotyping algorithm
                            as opposed to probablistic model
  -b, --het-bias M,N        homozygous alt/ref allele must have >= M/N times
                            more support than the next best allele [6,6]
GAF options:
  -G, --gaf                 output GAF genotypes instead of VCF
  -T, --traversals          output all candidate traversals in GAF
                            without doing any genotyping
  -M, --trav-padding N      extend each flank of traversals (from -T) with
                            reference path by N bases if possible
general options:
  -v, --vcf FILE            VCF file to genotype (must have been used
                            to construct input graph with -a)
  -a, --genotype-snarls     genotype every snarl, including reference calls
                            (use to compare multiple samples)
  -A, --all-snarls          genotype all snarls, including nested child snarls
                            (like deconstruct -a)
  -c, --min-length N        genotype only snarls with
                            at least one traversal of length >= N
  -C, --max-length N        genotype only snarls where
                            all traversals have length <= N
  -f, --ref-fasta FILE      reference FASTA
                            (required if VCF has symbolic deletions/inversions)
  -i, --ins-fasta FILE      insertions (required if VCF has symbolic insertions)
  -s, --sample NAME         sample name [SAMPLE]
  -r, --snarls FILE         snarls (from vg snarls) to avoid recomputing.
  -g, --gbwt FILE           only call genotypes present in given GBWT index
  -z, --gbz                 only call genotypes present in GBZ index
                            (applies only if input graph is GBZ)
  -N, --translation FILE    node ID translation (from vg gbwt --translation)
                            to apply to snarl names in output
  -O, --gbz-translation     use the ID translation from the input GBZ to
                            apply snarl names to snarl names/AT fields in output
  -p, --ref-path NAME       reference path to call on (may repeat; default all)
  -S, --ref-sample NAME     call on all paths with this sample
                            (cannot use with -p)
  -o, --ref-offset N        offset in reference path (may repeat; 1 per path)
  -l, --ref-length N        override reference length for output VCF contig
  -d, --ploidy N            ploidy of sample. {1, 2} [2]
  -R, --ploidy-regex RULES  use this comma-separated list of colon-delimited
                            REGEX:PLOIDY rules to assign ploidies to contigs
                            not visited by the selected samples, or to all
                            contigs simulated from if no samples are used.
                            Unmatched contigs get ploidy 2 (or that from -d).
  -n, --nested              activate nested calling mode (experimental)
  -I, --chains              call chains instead of snarls (experimental)
      --progress            show progress
  -t, --threads N           number of threads to use
  -h, --help                print this help message to stderr and exit

chunk: split graph or alignment into chunks

usage: vg chunk [options] > [chunk.vg]
Splits a graph and/or alignment into smaller chunks

Graph chunks are saved to .vg files, read chunks are saved to .gam files,
and haplotype annotations are saved to .annotate.txt files, of the form
<BASENAME>-<i>-<region name or "ids">-<start>-<length>.<ext>.
The BASENAME is specified with -b and defaults to "./chunks".

For a single-range chunk (-p or -r), the graph data is sent to
standard output instead of a file.

options:
  -x, --xg-name FILE            use this graph or xg index to chunk subgraphs
  -G, --gbwt-name FILE          use this GBWT haplotype index
                                for haplotype extraction (for -T)
  -a, --aln-name FILE           chunk alignments instead of graph (may repeat)
  -g, --aln-and-graph           when used in combination with -a,
                                both alignments and graph will be chunked
  -F, --in-gaf                  -a alignment is a sorted bgzipped GAF, not GAM
path chunking:
  -p, --path TARGET             write the chunk in the specified path range
                                (0-based inclusive, multiple allowed)
                                TARGET=path[:pos1[-pos2]] to standard output
  -P, --path-list FILE          for all paths in line separated file, 
                                write chunks for each as in -p
  -e, --input-bed FILE          write chunks for (0-based end-exclusive) regions
  -S, --snarls FILE             write given path-range(s) and all snarls 
                                fully contained in them, as alternative to -c
id range chunking:
  -r, --node-range N:M          write the chunk for this node range to stdout
  -R, --node-ranges FILE        write the chunk for each node range in
                                (newline or whitespace separated) file
  -n, --n-chunks N              generate N id-range chunks, determined via xg
simple alignment chunking:
  -m, --aln-split-size N        split alignments (-a, sort/index not required)
                                up into chunks with at most N reads each
component chunking:
  -C, --components              create a chunk for each connected component.
                                If targets given with (-p, -P, -r, -R),
                                limit to components containing them
  -M, --path-components         create a chunk for each path
                                in the graph's connected component
general:
  -s, --chunk-size N            create chunks spanning N bases
                                (or nodes with -r/-R) for all input regions.
  -o, --overlap N               overlap between chunks when using -s [0]
  -E, --output-bed FILE         write all created chunks to a bed file
  -b, --prefix BASENAME         write output chunk files [./chunk]
                                Files for chunk i will be named
                                <BASENAME>-<i>-<name>-<start>-<length>.<ext> 
  -c, --context-steps N         expand the context of the chunk N node steps [1]
  -l, --context-length N        expand the context of the chunk by N bp [0]
  -T, --trace                   trace haplotype threads in chunks
                                (and only expand forward from input coordinates)
                                Produces .annotate.txt file
                                with haplotype frequencies for each chunk.
      --no-embedded-haplotypes  don't load haplotypes from the graph. It is
                                possible to -T without any haplotypes available.
  -f, --fully-contained         only return GAM alignments that are
                                fully contained within chunk
  -u, --cut-alignments          cut alignments to be fully within the chunk
  -O, --output-fmt STR          output format {vg, pg, hg, gfa} [pg (vg for -T)]
  -t, --threads N               for parallel tasks, use this many threads [1]
  -h, --help                    print this help message to stderr and exit

construct: graph construction

usage: vg construct [options] >new.vg
options:
construct from a reference and variant calls:
  -r, --reference FILE   input FASTA reference (may repeat)
  -v, --vcf FILE         input VCF (may repeat)
  -n, --rename V=F       match contig V in the VCFs to contig F in the FASTAs
                         (may repeat)
  -a, --alt-paths        save paths for alts of variants by SHA1 hash
  -A, --alt-paths-plain  save paths for alts of variants by variant ID
                         if possible, otherwise SHA1
                         (IDs must be unique across all input VCFs)
  -R, --region REGION    specify a VCF contig name or 1-based inclusive region
                         (may repeat, if on different contigs)
  -C, --region-is-chrom  don't attempt to parse the regions (use when reference
                         sequence name could be parsed as a region)
  -z, --region-size N    variants per region to parallelize [1024]
  -t, --threads N        use N threads to construct graph [numCPUs]
  -S, --handle-sv        include structural variants in construction of graph.
  -I, --insertions FILE  a FASTA file containing insertion sequences 
                         (referred to in VCF) to add to graph.
  -f, --flat-alts        don't chop up alternate alleles from input VCF
  -l, --parse-max N      don't chop up alternate alleles from input VCF
                         longer than N [100]
  -i, --no-trim-indels   don't remove the 1bp ref base from indel alt alleles
  -N, --in-memory        construct entire graph in memory before outputting it
construct from a multiple sequence alignment:
  -M, --msa FILE         input multiple sequence alignment
  -F, --msa-format STR   format of the MSA file {fasta, clustal} [fasta]
  -d, --drop-msa-paths   don't add paths for the MSA sequences into the graph
shared construction options:
  -m, --node-max N       limit maximum allowable node sequence size [32]
                         nodes greater than this threshold will be divided
                         note: nodes larger than ~1024 bp can't be GCSA2-indexed
  -p, --progress         show progress
  -h, --help             print this help message to stderr and exit

convert: convert graphs between handle-graph compliant formats as well as GFA

usage: vg convert [options] <input-graph>
input options:
  -g, --gfa-in               input in GFA format
  -r, --in-rgfa-rank N       import rgfa tags with rank <= N as paths [0]
  -b, --gbwt-in FILE         input graph is a GBWTGraph using the GBWT in FILE
      --ref-sample STR       change haplotypes for this sample to
                             reference paths (may repeat)
      --hap-locus STR        change generic paths with this locus
                             to haplotype paths (must be used with --new-sample)
      --new-sample STR       when using --hap-locus, give the new haplotype
                             this sample name (must be used with --hap-locus)
gfa input options (use with -g):
  -T, --gfa-trans FILE       write gfa id conversions to FILE
output options:
  -v, --vg-out               output in VG's original Protobuf format
                             [DEPRECATED: use -p instead].
  -a, --hash-out             output in HashGraph format
  -p, --packed-out           output in PackedGraph format (default)
  -x, --xg-out               output in XG format
  -f, --gfa-out              output in GFA format
  -H, --drop-haplotypes      do not include haplotype paths in the output
                             (useful with GBWTGraph / GBZ inputs)
gfa output options (use with -f):
  -P, --rgfa-path STR        write given path as rGFA tags instead of lines
                             (may repeat, only rank-0 supported)
  -Q, --rgfa-prefix STR      write paths with this prefix as rGFA tags instead
                             of lines (may repeat, only rank-0 supported)
  -B, --rgfa-pline           paths written as rGFA tags also written as lines
  -W, --no-wline             write all paths as GFA P-lines instead of W-lines.
                             allows handling multiple phase blocks 
                             and subranges used together.
      --gbwtgraph-algorithm  always use the GBWTGraph library GFA algorithm.
                             not compatible with other GFA output options
                             or non-GBWT graphs.
      --vg-algorithm         always use the VG GFA algorithm. Works with all
                             options and graph types, but can't preserve
                             original GFA coordinates
      --no-translation       when using the GBWTGraph algorithm, convert graph
                             directly to GFA; do not use the translation
                             to preserve original coordinates
alignment options:
  -G, --gam-to-gaf FILE      convert GAM FILE to GAF
  -F, --gaf-to-gam FILE      convert GAF FILE to GAM
general options:
  -t, --threads N            use N threads [numCPUs]
  -h, --help                 print this help message to stderr and exit

deconstruct: create a VCF from variation in the graph

usage: vg deconstruct [options] [-p|-P] <PATH> <GRAPH>
Output VCF records for Snarls present in a graph (relative to a reference path).
options: 
  -p, --path NAME          a reference path to deconstruct against (may repeat).
  -P, --path-prefix NAME   all paths [minus GBWT threads / non-ref GBZ paths]
                           beginning with NAME used as reference (may repeat).
                           other non-ref paths not considered as samples. 
  -r, --snarls FILE        snarls file (from vg snarls) to avoid recomputing.
  -g, --gbwt FILE          consider alt traversals for GBWT haplotypes in FILE
                           (not needed for GBZ graph input).
  -T, --translation FILE   node ID translation (from vg gbwt --translation)
                           to apply to snarl names and AT fields in output
  -O, --gbz-translation    use the ID translation from the input GBZ to apply
                           snarl names to snarl names and AT fields in output
  -a, --all-snarls         process all snarls, including nested snarls
                           (by default only top-level snarls reported).
  -c, --context-jaccard N  set context mapping size used to disambiguate alleles
                           at sites with multiple reference traversals [10000]
  -u, --untangle-travs     use context mapping fpr reference-relative positions
                           of each step in allele traversals (AP INFO field).
  -K, --keep-conflicted    retain conflicted genotypes in output.
  -S, --strict-conflicts   drop genotypes when we have more than one haplotype
                           for any given phase (set by default for GBWT input).
  -C, --contig-only-ref    only use CONTIG name (not SAMPLE#CONTIG#HAPLOTYPE)
                           for reference if possible (i.e. only one ref sample)
  -L, --cluster F          cluster traversals whose (handle) Jaccard coefficient
                           is >= F together [1.0; experimental]
  -n, --nested             write a nested VCF, plus special tags [experimental]
  -R, --star-allele        use *-alleles to denote alleles that span
                           but do not cross the site. Only works with -n
  -t, --threads N          use N threads
  -v, --verbose            print some status messages
  -h, --help               print this help message to stderr and exit

filter: filter reads and get statistics by read

usage: vg filter [options] <alignment.gam> > out.gam
Filter alignments by properties.

options:
  -M, --input-mp-alns          input is multipath alignments (GAMP), not GAM
  -n, --name-prefix NAME       keep only reads with this name prefix ['']
  -N, --name-prefixes FILE     keep reads with names with any of these prefixes,
                               one per nonempty line
  -e, --exact-name             match read names exactly instead of by prefix
  -a, --subsequence NAME       keep reads that contain this subsequence
  -A, --subsequences FILE      keep reads that contain one of these subsequences
                               one per nonempty line
  -p, --proper-pairs           keep reads annotated as being properly paired
  -P, --only-mapped            keep reads that are mapped
  -X, --exclude-contig REGEX   drop reads with refpos annotations on contigs
                               matching the given regex (may repeat)
  -F, --exclude-feature NAME   drop reads with the given feature
                               in the "features" annotation (may repeat)
  -s, --min-secondary N        minimum score to keep secondary alignment
  -r, --min-primary N          minimum score to keep primary alignment
  -L, --max-length N           drop reads with length > N
  -O, --rescore                re-score reads using default parameters
                               and only alignment information
  -f, --frac-score             normalize score based on length
  -u, --substitutions          use substitution count instead of score
  -W, --overwrite-score        replace stored GAM score with computed/normalized
                               score
  -o, --max-overhang N         drop reads whose alignments begin or end
                               with an insert > N [99999]
  -m, --min-end-matches N      drop reads without >=N matches on each end
  -S, --drop-split             remove split reads taking nonexistent edges
  -x, --xg-name FILE           use this xg index/graph (required for -S and -D)
  -v, --verbose                print out statistics on numbers of reads dropped
  -V, --no-output              print out -v statistics and do not write the GAM
  -T, --tsv-out FIELD[;FIELD]  write TSV of given fields instead of filtered GAM
                               See wiki page:
                               "Getting alignment statistics with vg filter"
  -q, --min-mapq N             drop alignments with mapping quality < N
  -E, --repeat-ends N          drop reads with tandem repeat (motif size <= 2N,
                               spanning >= N bases) at either end
  -D, --defray-ends N          clip back the ends of ambiguously aligned reads
                               up to N bases
  -C, --defray-count N         stop defraying after N nodes visited
                               (used to keep runtime in check) [99999]
  -d, --downsample S.P         drop all but the given portion 0.P of the reads.
                               S may be an integer seed as in SAMtools
  -R, --max-reads N            drop all but N reads. Use on a single thread
  -i, --interleaved            both ends will be dropped if either fails filter
                               assume interleaved input
  -I, --interleaved-all        both ends will be dropped if *both* fail filters
                               assume interleaved input
  -b, --min-base-quality Q:F   drop reads with where fewer than fraction F bases
                               have base quality >= PHRED score Q.
  -G, --annotation K[:V]       keep reads if the annotation is present and 
                               not false/empty. If a value is given, keep reads
                               if the values are equal similar to running
                               jq 'select(.annotation.K==V)' on the json
  -c, --correctly-mapped       keep only reads marked as correctly-mapped
  -l, --first-alignment        keep only the first alignment for each read
                               Must be run with 1 thread
  -U, --complement             apply opposite of the filter from other arguments
  -B, --batch-size N           work in batches of N reads [512]
  -t, --threads N              number of threads [1]
      --progress               show progress
  -h, --help                   print this help message to stderr and exit

find: use an index to find nodes, edges, kmers, paths, or positions

usage: vg find [options] >sub.vg
options:
  -h, --help                  print this help message to stderr and exit
graph features:
  -x, --xg-name FILE          use this xg index or graph (instead of rocksdb db)
  -n, --node ID               find node(s), return 1-hop context as graph
  -N, --node-list FILE        whitespace or line delimited list of nodes to grab
      --mapping FILE          include nodes mapping to the selected node IDs
  -e, --edges-end ID          return edges on end of node with ID
  -s, --edges-start ID        return edges on start of node with ID
  -c, --context STEPS         expand the context of the subgraph this many steps
  -L, --use-length            treat STEPS in -c or M in -r as a length in bases
  -P, --position-in PATH      find the position of -n node in the given path
  -I, --list-paths            write out the path names in the index
  -r, --node-range N:M        get nodes from N to M
  -G, --gam GAM               accumulate the graph touched by GAM's alignments
      --connecting-start POS  find graph from POS (node ID, + or -, node offset)
                              connecting to --connecting-end
      --connecting-end POS    find graph to POS (node ID, + or -, node offset)
                              connecting from --connecting-start
      --connecting-range INT  traverse up to INT bases when going 
                              from --connecting-start to --connecting-end [100]
subgraphs by path range:
  -p, --path TARGET           find the node(s) in the specified path range(s)
                              TARGET=path[:pos1[-pos2]]
  -R, --path-bed FILE         read our targets from the given BED FILE
  -E, --path-dag              with -p or -R, gets any node in the partial order
                              from pos1 to pos2, assumes id sorted DAG
  -W, --save-to PREFIX        instead of writing target subgraphs to stdout,
                              write one per given target to a separate file
                              named PREFIX[path]:[start]-[end].vg
  -K, --subgraph-k K          instead of graphs, write kmers from the subgraphs
  -H, --gbwt FILE             when enumerating kmers from subgraphs, determine
                              their frequencies in this GBWT haplotype index
alignments:
  -l, --sorted-gam FILE       use this sorted, indexed GAM file
  -F, --sorted-gaf FILE       use this sorted, indexed GAF file
  -o, --alns-on N:M           write alignments which align to any of the
                              nodes between N and M (inclusive)
  -A, --to-graph VG           get alignments to the provided subgraph
sequences:
  -g, --gcsa FILE             use this GCSA2 index of the graph's sequence space
                              (required for sequence queries)
  -S, --sequence STR          search for sequence STR using
  -M, --mems STR              describe the super-maximal exact matches
                              of the STR (GCSA2) in JSON
  -B, --reseed-length N       find non-super-maximal MEMs inside SMEMs length>=N
  -f, --fast-reseed           use fast SMEM reseeding algorithm
  -Y, --max-mem N             maximum length of the MEM [GCSA2 order]
  -Z, --min-mem N             minimum length of the MEM [1]
  -D, --distance              return distance on path between pair of nodes (-n)
                              if -P not used, best path chosen heurstically
  -Q, --paths-named STR       return all paths with name prefix STR (may repeat)

gamsort: sort a GAM/GAF file or index a sorted GAM file

usage: vg gamsort [options] input > output

Sort a GAM/GAF file, or index a sorted GAM file.

General options:
  -p, --progress          show progress
  -s, --shuffle           shuffle reads by hash
  -t, --threads N         use N worker threads [4 for GAM, 1 for GAF]

GAM sorting options:
  -i, --index FILE        produce an index of the sorted GAM file
  -d, --dumb-sort         use naive sorting algorithm
                          (no tmp files, faster for small GAMs)

GAF sorting options:
  -G, --gaf-input         input is a GAF file
  -c, --chunk-size N      number of reads per chunk [1000000]
  -m, --merge-width N     number of files to merge at once [32]
  -S, --stable            use stable sorting
  -g, --gbwt-output FILE  write a GBWT index of the paths to FILE
  -b, --bidirectional     make the GBWT index bidirectional
  -h, --help              print this help message to stderr and exit


gbwt: build and manipulate GBWT and GBZ files

usage: vg gbwt [options] [args]

Manipulate GBWTs. Input GBWTs are loaded from input args
or built in earlier steps. See wiki page "VG GBWT Subcommand".
The input graph is provided with one of -x, -G, or -Z

General options:
  -h, --help              print this help message to stderr and exit
  -x, --xg-name FILE      read the graph from FILE
  -o, --output FILE       write output GBWT to FILE
  -d, --temp-dir DIR      use directory DIR for temporary files
  -p, --progress          show progress and statistics

GBWT construction parameters (for steps 1 and 4):
      --buffer-size N     construction buffer size in millions of nodes[100]
      --id-interval N     store path IDs at 1/N positions [1024]

Multithreading:
      --num-jobs N        use at most N parallel build jobs
                          (for -v, -G, -A, -l, -P) [4]
      --num-threads N     use N parallel search threads
                          (for -b and -r) [8]

Step 1: GBWT construction (requires -o and one of { -v, -G, -Z, -E, A }):
  -v, --vcf-input         index the haplotypes in the VCF files specified in
                          input args in parallel (requires -x, implies -f);
                          (inputs must be over different contigs,
                          does not store graph contigs in the GBWT)
      --preset X          use preset X (available: 1000gp)
      --inputs-as-jobs    create one build job for each input
                          instead of using first-fit heuristic
      --parse-only        store the VCF parses without building GBWTs
                          (use -o for file name prefix; skips later steps)
      --ignore-missing    don't warn when variants are missing from the graph
      --actual-phasing    don't treat unphased homozygous genotypes as phased
      --force-phasing     replace unphased genotypes with randomly phased ones
      --discard-overlaps  skip overlapping alternate alleles if the overlap
                          cannot be resolved instead of creating a phase break
      --batch-size N      index the haplotypes in batches of N samples [200]
      --sample-range X-Y  index samples X to Y (inclusive, 0-based)
      --rename V=P        VCF contig V matches path P in the graph (may repeat)
      --vcf-variants      variants in graph use VCF contig names, not path names
      --vcf-region C:X-Y  restrict VCF contig C to coordinates X to Y
                          (inclusive, 1-based; may repeat)
      --exclude-sample X  do not index the sample with name X
                          (faster than -R; may repeat)
  -G, --gfa-input         index walks or paths in the GFA file (one input arg)
      --max-node N        chop long segments into nodes of at most N bp
                          (use 0 to disable) [1024]
      --path-regex X      parse metadata as haplotypes from path names
                          using regex X instead of vg-parser-compatible rules
      --path-fields X     parse metadata as haplotypes, mapping regex submatches
                          to these fields instead of vg-parser-compatible rules
      --translation FILE  write the segment to node translation table to FILE
  -Z, --gbz-input         extract GBWT & GBWTGraph from GBZ from (one) input arg
  -I, --gg-in FILE        load GBWTGraph from FILE and GBWT from (one) input arg
  -E, --index-paths       index the embedded non-alt paths in the graph
                          (requires -x, no input args)
  -A, --alignment-input   index the alignments in the GAF files specified
                          in input args (requires -x)
      --gam-format        input files are in GAM format instead of GAF format

Step 2: Merge multiple input GBWTs (requires -o):
  -m, --merge             use the insertion algorithm
  -f, --fast              fast merging algorithm (node ids must not overlap)
  -b, --parallel          use the parallel algorithm
      --chunk-size N      search in chunks of N sequences [1]
      --pos-buffer N      use N MiB position buffers for each search thread [64]
      --thread-buffer N   use N MiB thread buffers for each search thread [256]
      --merge-buffers N   merge 2^N thread buffers into one file per merge [6]
      --merge-jobs N      run N parallel merge jobs [4]

Step 3: Alter GBWT (requires -o and one input GBWT):
  -R, --remove-sample X   remove sample X from the index (may repeat)
      --set-tag K=V       set a GBWT tag (may repeat)
      --set-reference X   set sample X as the reference (may repeat)

Step 4: Path cover GBWT construction 
(requires an input graph, -o, and one of { -a, -l, -P }):
  -a, --augment-gbwt      add path cover of missing components (one input GBWT)
  -l, --local-haplotypes  sample local haplotypes (one input GBWT)
  -P, --path-cover        build a greedy path cover (no input GBWTs)
  -n, --num-paths N       find N paths per component[64 for -l, 16 otherwise]
  -k, --context-length N  use N-node contexts [4]
      --pass-paths        include named graph paths in local haplotype
                          or greedy path cover GBWT

Step 5: GBWTGraph construction (requires an input graph and one input GBWT):
  -g, --graph-name FILE   build GBWTGraph and store it in FILE
      --gbz-format        serialize both GBWT and GBWTGraph in GBZ format
                          (makes -o unnecessary)

Step 6: R-index construction (one input GBWT):
  -r, --r-index FILE      build an r-index and store it in FILE

Step 7: Metadata (one input GBWT):
  -M, --metadata          print basic metadata
  -C, --contigs           print the number of contigs
  -H, --haplotypes        print the number of haplotypes
  -S, --samples           print the number of samples
  -L, --list-names        list contig/sample names (use with -C or -S)
  -T, --path-names        list path names
      --tags              list GBWT tags

Step 8: Paths (one input GBWT):
  -c, --count-paths       print the number of paths
  -e, --extract FILE      extract paths in SDSL format to FILE


giraffe: fast haplotype-aware read alignment

usage:
  vg giraffe -Z graph.gbz [-d graph.dist [-m graph.withzip.min -z graph.zipcodes]] <input options> [other options] > output.gam
  vg giraffe -Z graph.gbz --haplotype-name graph.hapl --kff-name sample.kff <input options> [other options] > output.gam

Fast haplotype-aware read mapper.

basic options:
  -Z, --gbz-name FILE           map to this GBZ graph
  -m, --minimizer-name FILE     use this minimizer index
  -z, --zipcode-name FILE       use these additional distance hints
  -d, --dist-name FILE          cluster using this distance index
  -p, --progress                show progress
  -t, --threads N               number of mapping threads to use
  -b, --parameter-preset NAME   set computational parameters [default]
                                (chaining-sr / default / fast / hifi / r10 / srold)
  -h, --help                    print full help with all available options
input options:
  -G, --gam-in FILE             read and realign these GAM-format reads
  -f, --fastq-in FILE           read and align these FASTQ/FASTA-format reads
                                (two are allowed, one for each mate)
  -i, --interleaved             GAM/FASTQ/FASTA input is interleaved pairs,
                                for paired-end alignment
      --comments-as-tags        treat comments in name lines as SAM-style tags
                                and annotate alignments with them
haplotype sampling:
      --haplotype-name FILE     sample from haplotype information in FILE
      --kff-name FILE           sample according to kmer counts in FILE
      --index-basename STR      name prefix for generated graph/index files
                                (default: from graph name)
      --set-reference STR       include this sample as a reference
                                in the personalized graph (may repeat)
alternate graphs:
  -x, --xg-name FILE            map to this graph (if no -Z / -g),
                                or use this graph for HTSLib output
  -g, --graph-name FILE         map to this GBWTGraph (if no -Z)
  -H, --gbwt-name FILE          use this GBWT index (when mapping to -x / -g)
output options:
  -N, --sample NAME             add this sample name
  -R, --read-group NAME         add this read group
  -o, --output-format NAME      output the alignments in NAME format [gam]
                                {gam / gaf / json / tsv / SAM / BAM / CRAM} 
      --ref-paths FILE          ordered list of paths in the graph, one per line
                                or HTSlib .dict, for HTSLib @SQ headers
      --ref-name NAME           name of reference in the graph for HTSlib output
      --named-coordinates       make GAM/GAF output in named-segment (GFA) space

haplotypes: haplotype sampling based on kmer counts

usage:
    vg haplotypes [options] -k kmers.kff -g output.gbz graph.gbz
    vg haplotypes [options] -H output.hapl graph.gbz
    vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz

Haplotype sampling based on kmer counts.

Output files:
  -g, --gbz-output FILE        write the output GBZ to file (requires -k)
  -H, --haplotype-output FILE  write haplotype information to file

Input files:
  -d, --distance-index FILE    use this distance index [<basename>.dist]
  -r, --r-index FILE           use this r-index [<basename>.ri]
  -i, --haplotype-input FILE   use this .hapl file (default: generate)
  -k, --kmer-input FILE        use kmer counts from this KFF file

Options for generating haplotype information:
      --kmer-length N          kmer length for building minimizer index[29]
      --window-length N        window length for building minimizer index [11]
      --subchain-length N      target length (in bp) for subchains [10000]
      --linear-structure       extend subchains to avoid haplotypes
                               visiting them multiple times

Options for sampling haplotypes:
      --preset STR             use preset X {default, haploid, diploid}
      --coverage N             kmer coverage in KFF file (default: estimate)
      --num-haplotypes N       generate N haplotypes [4]
                               with --diploid-sampling, use N candidates [32]
      --present-discount F     discount scores for present kmers by factor F
                               [0.9]
      --het-adjustment F       adjust scores for heterozygous kmers by F [0.05]
      --absent-score F         score absent kmers -F/+F [0.8]
      --haploid-scoring        use a scoring model without heterozygous kmers
      --diploid-sampling       choose the best pair from the sampled haplotypes
      --extra-fragments        select all candidates in bad subchains
                               in --diploid-sampling
      --badness F              threshold for badness of a subchain [4]
      --include-reference      include named and reference paths in the output
      --set-reference NAME     use sample X as a reference sample (may repeat)

Other options:
  -v, --verbosity N            verbosity level [0]
                               {0 = silent, 1 = basic, 2 = detailed, 3 = debug}
  -t, --threads N              approximate number of threads [8 on this system]
  -h, --help                   print this help message to stderr and exit


ids: manipulate node ids

usage: vg ids [options] <graph1.vg> [graph2.vg ...] >new.vg
options:
  -c, --compact        minimize the space of integers used by the ids
  -i, --increment N    increase ids by N
  -d, --decrement N    decrease ids by N
  -j, --join           make a joint ID space for all supplied graphs
                       by iterating through the supplied graphs and incrementing
                       their ids to be non-conflicting (modifies original files)
  -m, --mapping FILE   create an empty node mapping for vg prune
  -s, --sort           assign new node IDs in generalized topological sort order
  -h, --help           print this help message to stderr and exit

index: index graphs or alignments for random access or mapping

usage: vg index [options] <graph1.vg> [graph2.vg ...]
Creates an index on the specified graph or graphs. All graphs indexed must 
already be in a joint ID space.
general options:
  -h, --help                print this help message to stderr and exit
  -b, --temp-dir DIR        use DIR for temporary files
  -t, --threads N           number of threads to use
  -p, --progress            show progress
xg options:
  -x, --xg-name FILE        use this file to store a succinct, queryable version
                            of graph(s), or read for GCSA or distance indexing
  -L, --xg-alts             include alt paths in xg
gcsa options:
  -g, --gcsa-out FILE       output a GCSA2 index to the given file
  -f, --mapping FILE        use this node mapping in GCSA2 construction
  -k, --kmer-size N         index kmers of size N in the graph [16]
  -X, --doubling-steps N    use N doubling steps for GCSA2 construction [4]
  -Z, --size-limit N        limit temp disk space usage to N GB [2048]
  -V, --verify-index        validate the GCSA2 index using the input kmers
                            (important for testing)
gam indexing options:
  -l, --index-sorted-gam    input is sorted .gam format alignments,
                            store a GAI index of the sorted GAM in INPUT.gam.gai
vg in-place indexing options:
      --index-sorted-vg     input is ID-sorted .vg format graph chunks
                            store a VGI index of the sorted vg in INPUT.vg.vgi
snarl distance index options
  -j, --dist-name FILE      use this file to store a snarl-based distance index
      --snarl-limit N       don't store distances for snarls > N nodes [10000]
                            if 0 then don't store distances, only the snarl tree
      --no-nested-distance  only store distances along the top-level chain
  -w, --upweight-node N     upweight the node with ID N to push it to be part
                            of a top-level chain (may repeat)

inject: lift over alignments for the graph

usage: vg inject -x graph.xg [options] input.[bam|sam|cram] >output.gam

options:
  -x, --xg-name FILE        use this graph or xg index (required, non-XG okay)
  -i, --add-identity        calculate & add 'identity' statistic to output GAM
  -r, --rescore             re-score alignments
  -o, --output-format NAME  output alignment format {gam / gaf / json} [gam]
  -t, --threads N           number of threads to use
  -h, --help                print this help message to stderr and exit

map: MEM-based read alignment

usage: vg map [options] -d idxbase -f in1.fq [-f in2.fq] >aln.gam
Align reads to a graph.

graph/index:
  -d, --base-name BASE             use BASE.xg and BASE.gcsa as input indexes
  -x, --xg-name FILE               use this xg index or graph [<graph>.vg.xg]
  -g, --gcsa-name FILE             use this GCSA2 index [<graph>.gcsa]
  -1, --gbwt-name FILE             use this GBWT haplotype index [<graph>.gbwt]
algorithm:
  -t, --threads N                  number of compute threads to use
  -k, --min-mem INT                minimum MEM length (if 0 estimate via -e) [0]
  -e, --mem-chance FLOAT           this fraction of -k length hits
                                   will be by chance [5e-4]
  -c, --hit-max N                  ignore MEMs who have >N hits in our index
                                   (0 for no limit) [2048]
  -Y, --max-mem INT                ignore MEMs longer than INT (unset if 0) [0]
  -r, --reseed-x FLOAT             look for internal seeds inside a seed
                                   longer than FLOAT*--min-seed [1.5]
  -u, --try-up-to INT              attempt to align up to the INT best candidate
                                   chains of seeds (1/2 for paired) [128]
  -l, --try-at-least INT           attempt to align at least the INT best
                                   candidate chains of seeds [1]
  -E, --approx-mq-cap INT          weight MQ by suffix tree based estimate
                                   when estimate less than FLOAT [0]
  -7, --id-mq-weight N             scale mapping quality by the alignment score
                                   identity to this power [2]
  -W, --min-chain INT              discard a chain if seeded bases are
                                   shorter than INT [0]
  -C, --drop-chain FLOAT           drop chains shorter than FLOAT fraction of
                                   the longest overlapping chain [0.45]
  -n, --mq-overlap FLOAT           scale MQ by count of alignments with FLOAT
                                   overlap in the query with the primary [0]
  -P, --min-ident FLOAT            accept alignment only if the alignment
                                   identity is >= FLOAT [0]
  -H, --max-target-x N             skip cluster subgraphs with
                                   length > N*read_length [100]
  -w, --band-width INT             band width for long read alignment [256]
  -O, --band-overlap INT           band overlap for long read alignment [{-w}/8]
  -J, --band-jump INT              the maximum number of bands of insertion we
                                   consider in the alignment chain model [128]
  -B, --band-multi INT             consider this many alignments of each band
                                   in banded alignment [16]
  -Z, --band-min-mq INT            treat bands with < INT MQ as unaligned [0]
  -I, --fragment STR               fragment length distribution specification
                                   STR=m:μ:σ:o:d [5000:0:0:0:1]
                                   max:mean:stdev:orientation (1=same/0=flip):
                                   direction (1=forward, 0=backward)
  -U, --fixed-frag-model           don't learn the pair fragment model online,
                                   use -I without update
  -p, --print-frag-model           suppress alignment output and print the
                                   fragment model on stdout as per -I format
  -4, --frag-calc INT              update the fragment model
                                   every INT perfect pairs [10]
  -3, --fragment-x FLOAT           calculate max fragment size as
                                   frag_mean+frag_sd*FLOAT [10]
  -0, --mate-rescues INT           attempt up to INT mate rescues per pair [64]
  -S, --unpaired-cost INT          penalty for an unpaired read pair [17]
  -8, --no-patch-aln               do not patch banded alignments by
                                   locally aligning unaligned regions
      --xdrop-alignment            use X-drop heuristic
                                   (much faster for long-read alignment)
      --max-gap-length INT         maximum gap length allowed in each contiguous
                                   alignment (for X-drop alignment) [40]
scoring:
  -q, --match INT                  use this match score [1]
  -z, --mismatch INT               use this mismatch penalty [4]
      --score-matrix FILE          use this 4x4 integer substitution scoring
                                   matrix (in the order ACGT)
  -o, --gap-open INT               use this gap open penalty [6]
  -y, --gap-extend INT             use this gap extension penalty [1]
  -L, --full-l-bonus INT           the full-length alignment bonus [5]
  -2, --drop-full-l-bonus          remove the full length bonus from the score
                                   before sorting and MQ calculation
  -a, --hap-exp FLOAT              the exponent for haplotype consistency
                                   likelihood in alignment score [1]
      --recombination-penalty NUM  use this log recombination penalty
                                   for GBWT haplotype scoring [20.7]
  -A, --qual-adjust                perform base quality adjusted alignments
                                   (requires base quality input)
preset:
  -m, --alignment-model STR        use a preset alignment scoring model, either
                                   "short" (default) or "long" (ONT/PacBio)
                                   "long" is equivalent to
                                   `-u 2 -L 63 -q 1 -z 2 -o 2 -y 1 -w 128 -O 32`
input:
  -s, --sequence STR               align a string to the graph in graph.vg
                                   using partial order alignment
  -V, --seq-name STR               name the sequence STR
                                   (for graph modification with new named paths)
  -T, --reads FILE                 take reads (one per line) from FILE,
                                   write alignments to stdout
  -b, --hts-input FILE             align reads from stdin htslib-compatible FILE
                                   (BAM/CRAM/SAM), alignments to stdout
  -G, --gam-input FILE             realign GAM input
  -f, --fastq FILE                 input FASTQ or (2-line format) FASTA, maybe
                                   compressed; two allowed, one for each mate
  -F, --fasta FILE                 align the sequences in a FASTA file that may
                                   have multiple lines per reference sequence
      --comments-as-tags           intepret comments in name lines as SAM-style
                                   tags and annotate alignments with them
  -i, --interleaved                FASTQ or GAM is interleaved paired-ended
  -N, --sample NAME                for --reads input, add this sample
  -R, --read-group NAME            for --reads input, add this read group
output:
  -j, --output-json                output JSON rather than an alignment stream
                                   (helpful for debugging)
  -%, --gaf                        output alignments in GAF format
  -5, --surject-to TYPE            surject the output into the graph's paths,
                                   writing TYPE {bam, sam, cram}
      --ref-paths FILE             ordered list of paths in graph, one per line
                                   or HTSlib .dict, for HTSLib @SQ headers
      --ref-name NAME              reference assembly in graph for HTSlib output
  -9, --buffer-size INT            buffer this many alignments together
                                   before outputting in GAM [512]
  -X, --compare                    realign -G GAM input, writing alignment with
                                   "correct" field set to overlap with input
  -v, --refpos-table               for efficient testing output a table of
                                   name, chr, pos, mq, score
  -K, --keep-secondary             produce alignments for secondary input
                                   alignments in addition to primary ones
  -M, --max-multimaps INT          produce up to INT alignments per read [1]
  -Q, --mq-max INT                 cap the mapping quality at INT [60]
      --exclude-unaligned          exclude reads with no alignment
  -D, --debug                      print debugging information to stderr
  -^, --log-time                   print runtime to stderr
  -h, --help                       print this help message to stderr and exit

minimizer: build a minimizer index or a syncmer index

usage: vg minimizer [options] -d graph.dist -o graph.min graph

Builds a (w, k)-minimizer index or a (k, s)-syncmer index of the threads in the
GBWT. The graph can be any HandleGraph, which will be made into a GBWTGraph.
The transformation can be avoided by providing a GBWTGraph or a GBZ graph.

Required options:
  -d, --distance-index FILE  annotate hits with positions in this distance index
  -o, --output-name FILE     store the index in a file

Minimizer options:
  -k, --kmer-length N        length of the kmers in the index [29] (max 31)
  -w, --window-length N      choose minimizer from a window of N kmers [11]
  -c, --closed-syncmers      index closed syncmers instead of minimizers
  -s, --smer-length N        use smers of length N in closed syncmers [18]

Weighted minimizers:
  -W, --weighted             use weighted minimizers
      --threshold N          downweight kmers with more than N hits [500]
      --iterations N         downweight frequent kmers by N iterations [3]
      --fast-counting        use the fast kmer counting algorithm (default)
      --save-memory          use the space-efficient kmer counting algorithm
      --hash-table N         use 2^N-cell hash tables for kmer counting
                             (default: guess)

Other options:
  -z, --zipcode-name FILE    store the distances that are too big in afile
                             if no -z, some distances may be discarded
  -l, --load-index FILE      load this index and insert the new kmers into it
                             (overrides minimizer / weighted minimizer options)
  -g, --gbwt-name FILE       use this GBWT index (required with a non-GBZ graph)
  -E, --rec-mode             use recombination-aware MinimizerIndex
  -p, --progress             show progress information
  -t, --threads N            use N threads for index construction [8]
                             (using more than 16 threads rarely helps)
      --no-dist              build the index without distance index annotations
                             (not recommended)
  -h, --help                 print this help message to stderr and exit


mod: filter, transform, and edit the graph

usage: vg mod [options] <graph.vg> >[mod.vg]
Modifies graph, outputs modified on stdout.

options:
  -c, --compact-ids        should we sort and compact the ID space? (default no)
  -b, --break-cycles       break graph cycles with approximate topological sort
  -n, --normalize          normalize graph so edges are always non-redundant
                           (nodes have unique starting and ending bases relative
                           to neighbors, edges that do not introduce new paths
                           are removed, and neighboring nodes are merged)
  -U, --until-normal N     iterate normalization at most N times
  -z, --nomerge-pre STR    do not let normalize (-n/-U) zip up any pair of nodes
                           that both belong to path with prefix STR
  -E, --unreverse-edges    flip doubly-reversing edges so that they are
                           represented on the forward strand of the graph
  -s, --simplify           remove redundancy from the graph
                           that will not change its path space
  -d, --dagify-step N      copy strongly connected components of graph N times,
                           forwording edges from old to new copies
                           to convert the graph into a DAG
  -w, --dagify-to N        copy strongly connected components of the graph,
                           forwarding edges from old to new copies
                           to convert the graph into a DAG
                           until shortest path through each SCC is N bases long
  -L, --dagify-len-max N   stop a dagification step if the unrolling component
                           has this much sequence
  -f, --unfold N           represent inversions accessible up to N from
                           the forward component of the graph
  -O, --orient-forward     orient the nodes in the graph forward
  -N, --remove-non-path    keep only nodes and edges which are part of paths
  -A, --remove-path        keep only nodes and edges which aren't part of a path
  -k, --keep-path NAME     keep only nodes and edges in the path (may repeat)
  -V, --invert-keep-path   keep only nodes and edges in paths not passed to -k
  -R, --remove-null        remove nodes with no sequence, forwarding their edges
  -g, --subgraph ID        gets the subgraph rooted at node ID (may repeat)
  -x, --context N          steps the subgraph out by N steps [1]
  -p, --prune-complex      remove nodes that are reached by paths of --length
                           which cross more than --edge-max edges
  -S, --prune-subgraphs    remove subgraphs which are shorter than --length
  -l, --length N           for pruning complex regions and short subgraphs
  -X, --chop N             chop nodes in the graph so they are <=N bp long
  -u, --unchop             where two nodes are only connected to each other and
                           by only one edge, replace the pair with a single node
                           that is the concatenation of their labels
  -e, --edge-max N         consider paths which make edge choices at <= N points
  -M, --max-degree N       unlink nodes that have edge degree greater than N
  -m, --markers            join all head and tails nodes to marker nodes
                           (### starts and $$$ ends) of --length, for debugging
  -y, --destroy-node ID    remove node with given id
  -a, --cactus             convert to cactus graph representation
  -v, --sample-vcf FILE    for a graph with allele paths,
                           compute the sample graph from the given VCF
  -G, --sample-graph FILE  subset augmented graph to sample graph via Locus file
  -t, --threads N          for parallel tasks, use this many threads
  -h, --help               print this help message to stderr and exit

mpmap: splice-aware multipath alignment of short reads

usage: vg mpmap [options] -x graph.xg -g index.gcsa [-f reads1.fq [-f reads2.fq] | -G reads.gam] > aln.gamp
Multipath align reads to a graph.

basic options:
  -h, --help                print this help message to stderr and exit
graph/index:
  -x, --graph-name FILE     graph (required; XG recommended but other formats
                            are acceptable: see `vg convert`)
  -g, --gcsa-name FILE      use this GCSA2 (FILE) & LCP (FILE.lcp) index pair
                            for MEMs (required; see `vg index`)
  -d, --dist-name FILE      use this snarl distance index for clustering
                            (recommended, see `vg index`)
  -s, --snarls FILE         align to alternate paths in these snarls
                            (unnecessary if providing -d, see `vg snarls`)
input:
  -f, --fastq FILE          input FASTQ (possibly gzipped), can be given twice
                            for paired ends (for stdin use -)
  -i, --interleaved         input contains interleaved paired ends
  -C, --comments-as-tags    intepret comments in name lines as SAM-style tags
                            and annotate alignments with them
algorithm presets:
  -n, --nt-type TYPE        sequence type preset: 'DNA' for genomic data,
                            'RNA' for transcriptomic data [RNA]
  -l, --read-length TYPE    read length preset: {very-short, short, long}
                            (approx. <50bp, 50-500bp, and >500bp) [short]
  -e, --error-rate TYPE     error rate preset: {low, high}
                            (approx. PHRED >20 and <20) [low]
output:
  -F, --output-fmt TYPE     format to output alignments in:
                            'GAMP' for multipath alignments,
                            'GAM'/'GAF' for single-path alignments,
                            'SAM'/'BAM'/'CRAM' for linear reference alignments
                            (may also require -S) [GAMP]
  -S, --ref-paths FILE      paths in graph are 1) one per line in a text file
                            or 2) in an HTSlib .dict, to treat as
                            reference sequences for HTSlib formats (see -F)
                            [all reference paths, all generic paths]
      --ref-name NAME       reference assembly in graph to use for
                            HTSlib formats (see -F) [all references]
  -N, --sample NAME         add this sample name to output
  -R, --read-group NAME     add this read group to output
  -p, --suppress-progress   do not report progress to stderr
computational parameters:
  -t, --threads INT         number of compute threads to use [all available]

advanced options:
algorithm:
  -X, --not-spliced         do not form spliced alignments, even with -n RNA
  -M, --max-multimaps INT   report up to INT mappings per read [10 RNA / 1 DNA]
  -a, --agglomerate-alns    combine separate multipath alignments into
                            one (possibly disconnected) alignment
  -r, --intron-distr FILE   intron length distribution
                            (from scripts/intron_length_distribution.py)
  -Q, --mq-max INT          cap mapping quality estimates at this much [60]
  -b, --frag-sample INT     look for INT unambiguous mappings to
                            estimate the fragment length distribution [1000]
  -I, --frag-mean FLOAT     mean for pre-determined fragment length distribution
                            (also requires -D)
  -D, --frag-stddev FLOAT   standard deviation for pre-determined fragment
                            length distribution (also requires -I)
  -G, --gam-input FILE      input GAM (for stdin, use -)
  -u, --map-attempts INT    perform up to INT mappings per read (0 for no limit)
                            [24 paired / 64 unpaired]
  -c, --hit-max INT         use at most this many hits for any match seeds
                            (0 for no limit) [1024 DNA / 100 RNA]
scoring:
  -A, --no-qual-adjust      do not perform base quality adjusted alignments
                            even when base qualities are available
  -q, --match INT           use INT match score [1]
  -z, --mismatch INT        use INT mismatch penalty [4 low error, 1 high error]
  -o, --gap-open INT        use INT gap open penalty [6 low error, 1 high error]
  -y, --gap-extend INT      use INT gap extension penalty [1]
  -L, --full-l-bonus INT    add INT score to alignments that align each
                            end of the read [mismatch+1 short, 0 long]
  -w, --score-matrix FILE   use this 4x4 integer substitution scoring matrix
                            (in the order ACGT)
  -m, --remove-bonuses      remove full length alignment bonus in reported score

pack: convert alignments to a compact coverage index

usage: vg pack [options]
options:
  -x, --xg FILE          use this basis graph (does not have to be xg format)
  -o, --packs-out FILE   write compressed coverage packs to this output file
  -i, --packs-in FILE    begin by summing coverage packs from each provided FILE
  -g, --gam FILE         read alignments from this GAM file ('-' for stdin)
  -a, --gaf FILE         read alignments from this GAF file ('-' for stdin)
  -d, --as-table         write table on stdout representing packs
  -D, --as-edge-table    write table on stdout representing edge coverage
  -u, --as-qual-table    write table on stdout representing average node mapqs
  -e, --with-edits       record and write edits
                         rather than only recording graph-matching coverage
  -b, --bin-size N       number of sequence bases per CSA bin [inf]
  -n, --node ID          write table for only specified node(s)
  -N, --node-list FILE   white space or line delimited list of nodes to collect
  -Q, --min-mapq N       ignore reads with MAPQ < N
                         and positions with base quality < N [0]
  -c, --expected-cov N   expected coverage.  used only for memory tuning [128]
  -s, --trim-ends N      ignore the first and last N bases of each read
  -t, --threads N        use N threads [numCPUs]
  -h, --help             print this help message to stderr and exit

paths: traverse paths in the graph

usage: vg paths [options]
  -h, --help               print this help message to stderr and exit
input:
  -x, --xg FILE            use the paths and haplotypes in this graph FILE
                           Supports GBZ haplotypes. (also accepts -v, --vg)
  -g, --gbwt FILE          use the threads in the GBWT index in FILE
                           (graph also required for most output options;
                           -g takes priority over -x)
output graph (.vg format):
  -V, --extract-vg         output a path-only graph covering the selected paths
  -d, --drop-paths         output a graph with the selected paths removed
  -r, --retain-paths       output a graph with only the selected paths retained
  -n, --normalize-paths    output a graph where equivalent paths in a site are
                           merged (using selected paths to snap to if possible)
output path data:
  -X, --extract-gam        print (as GAM alignments) stored paths in the graph
  -A, --extract-gaf        print (as GAF alignments) stored paths in the graph
  -L, --list               print (one per line) path (or thread) names
  -E, --lengths            print a list of path names (as with -L)
                           but paired with their lengths
  -M, --metadata           print a table of path names and their metadata
  -C, --cyclicity          print a list of path names (as with -L)
                           but paired with flag denoting the cyclicity
  -F, --extract-fasta      print the paths in FASTA format
  -c, --coverage           print the coverage stats for selected paths
                           (not including cycles)
path selection:
  -p, --paths-file FILE    select paths named in a file (one per line)
  -Q, --paths-by STR       select paths with the given name prefix
  -S, --sample STR         select haplotypes or reference paths for this sample
  -a, --variant-paths      select variant paths added by 'vg construct -a'
  -G, --generic-paths      select generic, non-reference, non-haplotype paths
  -R, --reference-paths    select reference paths
  -H, --haplotype-paths    select haplotype paths
configuration:
  -o, --overlay            apply a ReferencePathOverlayHelper to the graph
  -t, --threads N          number of threads to use [all available]
                           applies only to snarl finding within -n

prune: prune the graph for GCSA2 indexing

usage: vg prune [options] <graph.vg> >[output.vg]

Prunes the complex regions of the graph for GCSA2 indexing.
Pruning the graph removes embedded paths.

Pruning parameters:
  -k, --kmer-length N    kmer length used for pruning
                         defaults: 24 with -P; 24 with -r; 24 with -u
  -e, --edge-max N       remove the edges on kmers making > N edge choices
                         defaults: 3 with -P; 3 with -r; 3 with -u
  -s, --subgraph-min N   remove subgraphs of < N bases
                         defaults: 33 with -P; 33 with -r; 33 with -u
  -M, --max-degree N     if N > 0, remove nodes with degree > N before pruning
                         defaults: 0 with -P; 0 with -r; 0 with -u

Pruning modes (-P, -r, and -u are mutually exclusive):
  -P, --prune            simply prune the graph (default)
  -r, --restore-paths    restore the edges on non-alt paths
  -u, --unfold-paths     unfold non-alt paths and GBWT threads
  -v, --verify-paths     verify that the paths exist after pruning
                         (potentially very slow)

Unfolding options:
  -g, --gbwt-name FILE   unfold the threads from this GBWT index
  -m, --mapping FILE     store node mapping for duplicates (required with -u)
  -a, --append-mapping   append to the existing node mapping

Other options:
  -p, --progress         show progress
  -t, --threads N        use N threads [8]
  -d, --dry-run          determine the validity of the combination of options
  -h, --help             print this help message to stderr and exit


rna: construct splicing graphs and pantranscriptomes

usage: vg rna [options] graph.[vg|pg|hg|gbz] > splicing_graph.[vg|pg|hg]

General options:
  -t, --threads INT          number of compute threads to use [1]
  -p, --progress             show progress
  -h, --help                 print this help message to stderr and exit

Input options:
  -n, --transcripts FILE     transcript file(s) in gtf/gff format (may repeat)
  -m, --introns FILE         intron file(s) in bed format (may repeat)
  -y, --feature-type NAME    parse only this feature type in the GTF/GFF
                             (parses all if empty) [exon]
  -s, --transcript-tag NAME  use this attribute tag in the GTF/GFf file(s) as ID
                             to group exons and name paths [transcript_id]
  -l, --haplotypes FILE      project transcripts onto haplotypes in GBWT index
  -z, --gbz-format           input graph is GBZ format (has graph & GBWT index)

Construction options:
  -j, --use-hap-ref          use haplotype paths in GBWT index as references
                             (disables projection)
  -e, --proj-embed-paths     project transcripts onto embedded haplotype paths
  -c, --path-collapse TYPE   collapse identical transcript paths across
                             no|haplotype|all paths [haplotype]
  -k, --max-node-length INT  chop nodes longer than INT (disable with 0) [0]
  -d, --remove-non-gene      remove intergenic and intronic regions
                             (deletes all paths in the graph)
  -o, --do-not-sort          do not topological sort and compact the graph
DON'T FORGET TO EMBED PATHS:
  -r, --add-ref-paths        add reference transcripts as embedded paths
  -a, --add-hap-paths        add projected transcripts as embedded paths

Output options:
  -b, --write-gbwt FILE      write pantranscriptome transcript paths as GBWT
  -v, --write-hap-gbwt FILE  write input haplotypes as a GBWT
                             with node IDs matching the output graph
  -f, --write-fasta FILE     write pantranscriptome transcript sequences to here
  -i, --write-info FILE      write pantranscriptome transcript info table as TSV
  -q, --out-exclude-ref      exclude reference transcripts from pantranscriptome
  -g, --gbwt-bidirectional   use bidirectional paths in GBWT index construction


sim: simulate reads from a graph

usage: vg sim [options]
Samples sequences from the xg-indexed graph.

basic options:
  -h, --help                  print this help message to stderr and exit
  -x, --xg-name FILE          use the graph in FILE (required)
  -n, --num-reads N           simulate N reads or read pairs
  -l, --read-length N         simulate reads of length N
  -r, --progress              show progress information
output options:
  -a, --align-out             write alignments in GAM-format
  -q, --fastq-out             write reads in FASTQ format
  -J, --json-out              write alignments in JSON-format GAM (implies -a)
      --multi-position        annotate with multiple reference positions
simulation parameters:
  -F, --fastq FILE            match the error profile of NGS reads in FILE,
                              repeat for paired reads (ignores -l,-f)
  -I, --interleaved           reads in FASTQ (-F) are interleaved read pairs
  -s, --random-seed N         use this specific seed for the PRNG
  -e, --sub-rate FLOAT        base substitution rate [0.0]
  -i, --indel-rate FLOAT      indel rate [0.0]
  -d, --indel-err-prop FLOAT  proportion of trained errors from -F
                              that are indels [0.01]
  -S, --scale-err FLOAT       scale trained error probs from -F by FLOAT [1.0]
  -f, --forward-only          don't simulate from the reverse strand
  -p, --frag-len N            make paired end reads with fragment length N
  -v, --frag-std-dev FLOAT    use this standard deviation
                              for fragment length estimation
  -N, --allow-Ns              allow reads to be sampled with Ns in them
      --max-tries N           attempt sampling operations up to N times [100]
  -t, --threads N             number of compute threads (only when using -F) [1]
simulate from paths:
  -P, --path NAME             simulate from this path
                              (may repeat; cannot also give -T)
  -A, --any-path              simulate from any path (overrides -P)
  -m, --sample-name NAME      simulate from this sample (may repeat)
  -R, --ploidy-regex RULES    use this comma-separated list of colon-delimited
                              REGEX:PLOIDY rules to assign ploidies to contigs
                              not visited by the selected samples, or to all
                              contigs simulated from if no samples are used.
                              Unmatched contigs get ploidy 2
  -g, --gbwt-name FILE        use samples from this GBWT index
  -T, --tx-expr-file FILE     simulate from an expression profile formatted as
                              RSEM output (cannot also give -P)
  -H, --haplo-tx-file FILE    transcript origin info table from vg rna -i
                              (required for -T on haplotype transcripts)
  -u, --unsheared             sample from unsheared fragments
  -E, --path-pos-file FILE    output a TSV with sampled position on path
                              of each read (requires -F)

stats: metrics describing graph and alignment properties

usage: vg stats [options] [<graph file>]
options:
  -z, --size               size of graph
  -N, --node-count         number of nodes in graph
  -E, --edge-count         number of edges in graph
  -l, --length             length of sequences in graph
  -L, --self-loops         number of self-loops
  -s, --subgraphs          describe subgraphs of graph
  -H, --heads              list the head nodes of the graph
  -T, --tails              list the tail nodes of the graph
  -e, --nondeterm          list the nondeterministic edge sets
  -c, --components         print the strongly connected components of the graph
  -A, --is-acyclic         print if the graph is acyclic or not
  -n, --node ID            consider node with the given id
  -d, --to-head            show distance to head for each provided node
  -t, --to-tail            show distance to head for each provided node
  -a, --alignments FILE    compute stats for reads aligned to the graph
  -r, --node-id-range      X:Y where X and Y are the smallest and largest
                           node id in the graph, respectively
  -o, --overlap PATH       for each overlapping path mapping in the graph write:
                              PATH, other_path, rank1, rank2
                           multiple allowed; limit comparison to those provided
  -O, --overlap-all        print overlap table for cartesian product of paths
  -R, --snarls             print statistics for each snarl
      --snarl-contents     print table of <snarl, depth, parent, node ids>
      --snarl-sample NAME  print out reference coordinates on given sample
  -C, --chains             print statistics for each chain
  -F, --format             graph type {VG-Protobuf, PackedGraph, HashGraph, XG}
                           Can't detect Protobuf if graph read from stdin
  -D, --degree-dist        print degree distribution of the graph.
  -b, --dist-snarls FILE   print sizes/depths of the snarls in distance index
  -p, --threads N          number of threads to use [all available]
  -v, --verbose            output longer reports
  -P, --progress           show progress
  -h, --help               print this help message to stderr and exit

surject: map alignments onto specific paths

usage: vg surject [options] <aln.gam> >[proj.cram]
Transforms alignments to be relative to particular paths.

options:
  -x, --xg-name FILE        use this graph or xg index (required)
  -t, --threads N           number of threads to use
  -p, --into-path NAME      surject into this path or its subpaths (may repeat)
                            default: reference, then non-alt generic
  -F, --into-paths FILE     surject into path names listed in
                            HTSlib sequence dictionary or path list FILE
  -n, --into-ref NAME       surject into this reference assembly
  -i, --interleaved         GAM is interleaved paired-ended, so pair reads
                            when outputting HTS formats
  -M, --multimap            include secondary alignments to all
                            overlapping paths instead of just primary
  -G, --gaf-input           input file is GAF instead of GAM
  -m, --gamp-input          input file is GAMP instead of GAM
  -c, --cram-output         write CRAM to stdout
  -b, --bam-output          write BAM to stdout
  -s, --sam-output          write SAM to stdout
  -u, --supplementary       divide into supplementary alignments as necessary
  -l, --subpath-local       let the multipath mapping surjection produce local
                            (rather than global) alignments
  -T, --max-tail-len N      only align up to N bases of read tails [10000]
  -g, --max-graph-scale X   make reads unmapped if alignment target subgraph
                            size exceeds read length by a factor of X 
                            (default: 819.2 or 134218 with -S)
  -P, --prune-low-cplx      prune short/low complexity anchors in realignment
  -I, --max-slide N         look for offset duplicates of anchors up to N bp
                            away when pruning (default: 6)
  -a, --max-anchors N       use <= N anchors per target path [unlimited]
  -S, --spliced             interpret long deletions against paths
                            as spliced alignments
  -A, --qual-adj            adjust scoring for base qualities, if available
  -E, --extra-gap-cost N    for dynamic programming, add N to the gap open cost
                            of the 10x-scaled scoring parameters
  -N, --sample NAME         set this sample name for all reads
  -R, --read-group NAME     set this read group for all reads
  -f, --max-frag-len N      reads with fragment lengths greater than N won't be
                            marked properly paired in SAM/BAM/CRAM
  -L, --list-all-paths      annotate SAM records with a list of all attempted
                            re-alignments to paths in SS tag
  -H, --graph-aln           annotate SAM records with cs-style difference string
                            of the pre-surjected graph alignment in GR tag
  -C, --compression N       level for compression [0-9]
  -V, --no-validate         skip checking whether alignments plausibly are
                            against the provided graph
  -w, --watchdog-timeout N  warn when reads take more than N seconds to surject
  -r, --progress            show progress
  -h, --help                print this help message to stderr and exit

view: format conversions for graphs and alignments

usage: vg view [options] [ <graph.vg> | <graph.json> | <aln.gam> | <read1.fq> [<read2.fq>] ]
options:
  -g, --gfa                 output GFA format (default)
  -F, --gfa-in              input GFA format, reducing overlaps if they occur
  -v, --vg                  output VG format [DEPRECATED, use vg convert]
  -V, --vg-in               input VG format only
  -j, --json                output JSON format
  -J, --json-in             input JSON format (use with e.g. -a as necessary)
  -c, --json-stream         streaming conversion of a VG format graph
                            in line delimited JSON format
                            (this cannot be loaded directly via -J)
  -G, --gam                 output GAM format (vg alignment format)
  -Z, --translation-in      input is a graph translation description
  -t, --turtle              output RDF/turtle format (can not be loaded by VG)
  -T, --turtle-in           input turtle format.
  -r, --rdf-base-uri URI    set base uri for the RDF output
  -a, --align-in            input GAM format, or JSON version of GAM format
  -A, --aln-graph GAM       add alignments from GAM to the graph
  -q, --locus-in            input is Locus format, or JSON version of it
  -z, --locus-out           output is Locus format
  -Q, --loci FILE           input is Locus format for use by dot output
  -d, --dot                 output dot format
  -S, --simple-dot          simple alignments & no node labels in dot output
  -u, --noseq-dot           show size instead of sequence in dot output
  -e, --ascii-labels        label paths/superbubbles with char/colors vs. emoji
  -Y, --ultra-label         label nodes with emoji/colors for ultrabubbles
  -m, --skip-missing        skip mappings to nodes not in the graph
                            when drawing alignments
  -C, --color               color nodes not in reference path (DOT OUTPUT ONLY)
  -p, --show-paths          show paths in dot output
  -w, --walk-paths          add labeled edges to represent paths in dot output
  -n, --annotate-paths      add labels to edges to represent paths in dot output
  -M, --show-mappings       with -p, print the mappings in each path in JSON
  -I, --invert-ports        invert edge ports in dot so that ne->nw is reversed
  -s, --random-seed N       use this seed for path symbols in dot output
  -b, --bam                 input BAM or other htslib-parseable alignments
  -f, --fastq-in            input fastq (output defaults to GAM). Takes two
                            positional file arguments if paired
  -X, --fastq-out           output fastq (input defaults to GAM)
  -i, --interleaved         fastq is interleaved paired-ended
  -L, --pileup              output VG Pileup format
  -l, --pileup-in           input VG Pileup format, or JSON version of it
  -B, --distance-in         input distance index
  -R, --snarl-in            input VG Snarl format
  -E, --snarl-traversal-in  input VG SnarlTraversal format
  -K, --multipath-in        input VG MultipathAlignment format (GAMP),
                            or JSON version of it
  -k, --multipath           output VG MultipathAlignment format (GAMP)
  -D, --expect-duplicates   don't warn about duplicate nodes or edges
  -x, --extract-tag TAG     extract and concatenate messages with the given tag
      --first               only extract first message with the requested tag
      --verbose             explain the file being read with --extract-tag
  -7, --threads N           for parallel operations use this many threads [1]
  -h, --help                print this help message to stderr and exit

BUGS

Bugs can be reported at: https://github.com/vgteam/vg/issues

For technical support, please visit: https://www.biostars.org/tag/vg/

Clone this wiki locally