-
Notifications
You must be signed in to change notification settings - Fork 215
Construction Examples
Many users will not need to construct their own graphs and will instead be using a publicly-released reference, such as from the HRPC. However, if you wish to construct your own graphs, this page will walk you through the process.
You'll need to install vg first and have it on your PATH. This page will walk you through indexing an example graph using files in test/wiki-data/mock-hs37d5/data. If you want to substitute in your own data, then make sure it has the same shape as the test data (README). All chromosomes have chr<#/X/Y/MT>.vcf.gz and associated .tbi indexes, along with files hs37d5.fa, hs37d5.fa.fai, and sim.fq.
If actually working with whole-genome data set ${LARGE_DISK_TMP} to a temporary directory with at least 1.5 TB space.
In vg construct, we take a VCF file and the reference sequence it was based on and generate a variation graph suitable for use by other tools in vg. In the construction process we interpret the VCF+ref combination as a directed acyclic sequence graph. Variants in the input are decomposed using pairwise alignment between the reference and alternate, so that the input is interpreted as if it had been passed through vcfallelicprimitives.
These filepaths assume that test is the current working directory:
CONSTRUCTION=./wiki-data/mock-hs37d5
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
MAPPING=${GRAPHS}/node_mapping
LOGS=${CONSTRUCTION}/logs
# Create necessary directories
mkdir -p ${GRAPHS}
mkdir -p ${INDEXES}
mkdir -p ${LOGS}
LARGE_DISK_TMP="$(dirname "$(mktemp -u)")"
# Make sure logfile is open
LOGFILE=${LOGS}/vg_all.log
rm -f $LOGFILE
The construction process can be distributed, as we'll construct the whole genome graph one chromosome at a time. We specify which chromosome we want to work on by giving vg construct the -R/--region parameter, which limits the constructed graph to a particular reference range. Any samtools-format region specifier will work, but here we'll use whole chromosomes. Presently, it's advisable to break the construction apart due to memory requirements of vg--- it is optimized for high performance mutable graphs, and stores them in memory in a very non-succinct way.
Now we make a file $i.vg for each chromosome $i. The result will be around 4.5G of .vg files, which are compressed protobufv3 defined by the vg format specification with a chunked on-disk format implemented by stream.hpp.
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
"vg construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > ${GRAPHS}/chr{}.vg"
STOP=$(date +%s)
echo >> $LOGFILE
echo "Checkpoint: $STOP" >> $LOGFILE
echo "Construction: $(($STOP-$START)) seconds" >> $LOGFILE
However, if we wish to use these chromosome graphs in concert, it is useful to avoid node ID overlap. Currently each graph will have auto-generated node IDs which are only guaranteed to not conflict within that graph. The solution is vg ids -j. This command modifies all graph files provided by incrementing IDs as necessary to avoid cross-file conflicts.
# Harmonize node IDs across files
vg ids -j -m $MAPPING $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done)
STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE
You may want to index those VG graphs, creating a compact XG-format representation for quick access by other algorithms. This is a job for vg index -x. We can pass all the VG graphs as input and get back a single XG index.
OUTPUT=all
XG=${INDEXES}/${OUTPUT}.xg
# Set up log file
LOGFILE=${LOGS}/xg_${OUTPUT}.log
rm -f $LOGFILE
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE
# Perform XG indexing
vg index -x $XG $(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done)
STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE
#!/bin/bash
LOGS=${CONSTRUCTION}/logs
DATA=${CONSTRUCTION}/data
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
JOBS=14
export TMPDIR=${LARGE_DISK_TMP}
OUTPUT=all
GBWT=${INDEXES}/${OUTPUT}.gbwt
LOGFILE=${LOGS}/gbwt_${OUTPUT}.log
rm -f $LOGFILE
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo "Jobs: $JOBS" >> $LOGFILE
echo >> $LOGFILE
XG_ALTS=${TMPDIR}/graph-with-alts.xg
# Build a temporary XG index with alt paths.
vg index -x $XG_ALTS -L -p \
$(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}.vg; done) 2>> $LOGFILE
MID=$(date +%s)
echo >> $LOGFILE
echo "Checkpoint: $MID" >> $LOGFILE
echo "Temporary XG construction: $(($MID-$START)) seconds" >> $LOGFILE
echo >> $LOGFILE
# Build the GBWT.
vg gbwt -x $XG_ALTS -o $GBWT --preset 1000gp --num-jobs $JOBS -p -v \
$(for i in $(seq 1 22; echo X; echo Y); do echo ${DATA}/chr${i}.vcf.gz; done) 2>> $LOGFILE
rm -f $XG_ALTS
STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "GBWT construction: $(($STOP-$MID)) seconds" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE
Using a whole-genome GBWT for unfolding haplotypes requires VG version 1.30.0.
#!/bin/bash
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
INPUT_MAPPING=${GRAPHS}/node_mapping
OUTPUT_MAPPING=${INDEXES}/node_mapping
export TMPDIR=${LARGE_DISK_TMP}
INPUT=all
GBWT=${INDEXES}/${INPUT}.gbwt
LOGS=${CONSTRUCTION}/logs
LOGFILE=${LOGS}/prune_${INPUT}.log
rm -f $LOGFILE
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE
# Unfold haplotypes
cp $INPUT_MAPPING $OUTPUT_MAPPING
for i in $(seq 1 22; echo X; echo Y); do
CHR=${GRAPHS}/chr${i}
GBWT_INDEX=${INDEXES}/${INPUT}.gbwt
$VG prune -u -g $GBWT -a -m $OUTPUT_MAPPING -p ${CHR}.vg > ${CHR}_unfolded.vg 2>> $LOGFILE
echo >> $LOGFILE
done
STOP=$(date +%s)
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE
#!/bin/bash
GRAPHS=${CONSTRUCTION}/graphs
INDEXES=${CONSTRUCTION}/indexes
MAPPING=${INDEXES}/node_mapping
OUTPUT=all
export TMPDIR=${LARGE_DISK_TMP}
LOGS=${CONSTRUCTION}/logs
LOGFILE=${LOGS}/gcsa_${OUTPUT}.log
rm -f $LOGFILE
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE
vg index -g ${INDEXES}/${OUTPUT}.gcsa -f $MAPPING -p \
$(for i in $(seq 1 22; echo X; echo Y); do echo ${GRAPHS}/chr${i}_unfolded.vg; done) 2>> $LOGFILE
rm -f ${TMPDIR}/vg-kmers-tmp-*
rm -f ${TMPDIR}/gcsa_*
STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE