Construction Examples

Many users will not need to construct their own graphs and will instead be using a publicly-released reference, such as from the HRPC. However, if you wish to construct your own graphs, this page will walk you through the process.

You'll need to install vg first and have it on your PATH. This page will walk you through indexing an example graph using files in test/wiki-data/mock-hs37d5/data. If you want to substitute in your own data, then make sure it has the same shape as the test data (README). All chromosomes have chr<#/X/Y/MT>.vcf.gz and associated .tbi indexes, along with files hs37d5.fa, hs37d5.fa.fai, and sim.fq.

If actually working with whole-genome data set ${LARGE_DISK_TMP} to a temporary directory with at least 1.5 TB space.

Graph construction

In vg construct, we take a VCF file and the reference sequence it was based on and generate a variation graph suitable for use by other tools in vg. In the construction process we interpret the VCF+ref combination as a directed acyclic sequence graph. Variants in the input are decomposed using pairwise alignment between the reference and alternate, so that the input is interpreted as if it had been passed through vcfallelicprimitives.

These filepaths assume that test is the current working directory:

CONSTRUCTION=./wiki-data/mock-hs37d5
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
MAPPING=${GRAPHS}/node_mapping
LOGS=${CONSTRUCTION}/logs

# Create necessary directories
mkdir -p ${GRAPHS}
mkdir -p ${INDEXES}
mkdir -p ${LOGS}
LARGE_DISK_TMP="$(dirname "$(mktemp -u)")"

# Make sure logfile is open
LOGFILE=${LOGS}/vg_all.log
rm -f $LOGFILE

The construction process can be distributed, as we'll construct the whole genome graph one chromosome at a time. We specify which chromosome we want to work on by giving vg construct the -R/--region parameter, which limits the constructed graph to a particular reference range. Any samtools-format region specifier will work, but here we'll use whole chromosomes. Presently, it's advisable to break the construction apart due to memory requirements of vg--- it is optimized for high performance mutable graphs, and stores them in memory in a very non-succinct way.

Now we make a file $i.vg for each chromosome $i. The result will be around 4.5G of .vg files, which are compressed protobufv3 defined by the vg format specification with a chunked on-disk format implemented by stream.hpp.

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "vg construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > ${GRAPHS}/chr{}.vg"

STOP=$(date +%s)
echo >> $LOGFILE
echo "Checkpoint: $STOP" >> $LOGFILE
echo "Construction: $(($STOP-$START)) seconds" >> $LOGFILE

However, if we wish to use these chromosome graphs in concert, it is useful to avoid node ID overlap. Currently each graph will have auto-generated node IDs which are only guaranteed to not conflict within that graph. The solution is vg ids -j. This command modifies all graph files provided by incrementing IDs as necessary to avoid cross-file conflicts.

# Harmonize node IDs across files
vg ids -j -m $MAPPING $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done)

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

XG construction

You may want to index those VG graphs, creating a compact XG-format representation for quick access by other algorithms. This is a job for vg index -x. We can pass all the VG graphs as input and get back a single XG index.

OUTPUT=all
XG=${INDEXES}/${OUTPUT}.xg

# Set up log file
LOGFILE=${LOGS}/xg_${OUTPUT}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

# Perform XG indexing
vg index -x $XG $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done)

STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

GBWT construction

#!/bin/bash

LOGS=${CONSTRUCTION}/logs
DATA=${CONSTRUCTION}/data
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
JOBS=14

export TMPDIR=${LARGE_DISK_TMP}

OUTPUT=all
GBWT=${INDEXES}/${OUTPUT}.gbwt

LOGFILE=${LOGS}/gbwt_${OUTPUT}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo "Jobs: $JOBS" >> $LOGFILE
echo >> $LOGFILE

XG_ALTS=${TMPDIR}/graph-with-alts.xg

# Build a temporary XG index with alt paths.
vg index -x $XG_ALTS -L -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done) 2>> $LOGFILE

MID=$(date +%s)
echo >> $LOGFILE
echo "Checkpoint: $MID" >> $LOGFILE
echo "Temporary XG construction: $(($MID-$START)) seconds" >> $LOGFILE
echo >> $LOGFILE

# Build the GBWT.
vg gbwt -x $XG_ALTS -o $GBWT --preset 1000gp --num-jobs $JOBS -p -v \
    $(for i in $(seq 1 22; echo X; echo Y); do echo ${DATA}/chr${i}.vcf.gz; done) 2>> $LOGFILE

rm -f $XG_ALTS

STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "GBWT construction: $(($STOP-$MID)) seconds" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

Graph pruning

Using a whole-genome GBWT for unfolding haplotypes requires VG version 1.30.0.

#!/bin/bash

GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
INPUT_MAPPING=${GRAPHS}/node_mapping
OUTPUT_MAPPING=${INDEXES}/node_mapping

export TMPDIR=${LARGE_DISK_TMP}

INPUT=all
GBWT=${INDEXES}/${INPUT}.gbwt

LOGS=${CONSTRUCTION}/logs
LOGFILE=${LOGS}/prune_${INPUT}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

# Unfold haplotypes
cp $INPUT_MAPPING $OUTPUT_MAPPING
for i in $(seq 1 22; echo X; echo Y); do
  CHR=${GRAPHS}/chr${i}
  GBWT_INDEX=${INDEXES}/${INPUT}.gbwt
  $VG prune -u -g $GBWT -a -m $OUTPUT_MAPPING -p ${CHR}.vg > ${CHR}_unfolded.vg 2>> $LOGFILE
  echo >> $LOGFILE
done

STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

GCSA construction

#!/bin/bash

GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
MAPPING=${INDEXES}/node_mapping
OUTPUT=all

export TMPDIR=${LARGE_DISK_TMP}

LOGS=${CONSTRUCTION}/logs
LOGFILE=${LOGS}/gcsa_${OUTPUT}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

vg index -g ${INDEXES}/${OUTPUT}.gcsa -f $MAPPING -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}_unfolded.vg; done) 2>> $LOGFILE
rm -f ${TMPDIR}/vg-kmers-tmp-*
rm -f ${TMPDIR}/gcsa_*

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

Construction Examples

Graph construction

XG construction

GBWT construction

Graph pruning

GCSA construction

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally