Skip to content

Long read giraffe demo

Xian Chang edited this page Oct 27, 2025 · 2 revisions

This demo walks through an example mapping long reads to a pangenome graph using vg giraffe, using a small dataset that can be run on a normal laptop computer. It will use data from https://github.com/vgteam/vg_snakemake/tree/master/testdata.

For a larger example and more details about how to run on real data, see the wiki page and the manpage. To reproduce the results from our preprint, see https://github.com/vgteam/long-read-giraffe-experiments/tree/main and https://github.com/vgteam/long-read-giraffe-experiments/tree/main.

Install vg

The long read giraffe paper was tested on version 1.68.0. To use pre-built binary for this version, go to the link above, click on the big green "Download" button, then run

chmod +x vg

to make the binary file executable. You can now run vg using the command ./vg.

Alternatively, install vg following the instructions here.

Prepare the graph

For this demo, you can download a pangenome graph in GBZ format from here. If you want to build a graph using your own data, you can find instructions in the wiki.

The GBZ file contains the graph structure (the nodes and edges) as well as the haplotype paths through the graph. For simulating reads, we are going to need the haplotypes in a separate GBWT file. Extract the gbwt file by running:

vg gbwt -o mhc.gbwt -g mhc.gg -Z mhc.gbz

Simulate reads

Now we are going to simulate long reads to map to the pangenome. Simulate reads by running:

vg sim -x mhc.gbz -l 10000 -n 5000 -e .002 -i .00001 -m MHC-HG00438 -g mhc.gbwt -a | vg view -aX - > reads.fastq

We note that vg sim has not been optimized for simulating long reads and will not produce realistic reads. If you want to simulate reads that better mimic real reads, we suggest using a different read simulator such as pbsim.

Map the reads with Giraffe

We are now ready to map the reads to the pangenome. Run giraffe:

vg giraffe -Z mhc.gbz -b hifi -f reads.fastq -p -t 1 > mapped.gam

When giraffe is done running, you should have a message that looks something like this:

Mapped 5000 reads across 1 threads in 93.4668 seconds with 0 additional single-threaded seconds.
Mapping speed: 53.4949 reads per second per thread
Used 186.828 CPU-seconds (including output).
Achieved 26.7626 reads per CPU-second (including output)
Memory footprint: 0.418903 GB

On a normal laptop computer running Giraffe with one thread, this took about 30 seconds to index the graph and a minute and a half to map the reads.

vg giraffe uses several indexes of the graph. The first time that vg giraffe is run, it will build these indexes and write them to automatically-named files, in this case mhc.dist, mhc.longread.withzip.min, and mhc.longread.zipcodes. The next time giraffe is run, it will look for these files instead of rebuilding them. They can also be specified with command line options.

Output

The output of vg giraffe was written to mapped.gam, which represents alignments of the reads to the graph. This is a non-human-readable file, but there are a few options to view it.

To convert the GAM file to a JSON, run:

vg view -aj mapped.gam > mapped.json

To convert the GAM to GAF (a human-readable graph alignment format), run:

vg convert --gam-to-gaf mapped.gam mhc.gbz > mapped.gaf

To convert the GAM to BAM by projecting onto the reference, run:

vg surject -x mhc.gbz -b mapped.gam > mapped.bam

The GAF file should contain 5000 lines that look something like this:

6e6e45f10b27909e	10000	0	10000	+	>36717>36718>36720>36721>36723>36724>36726>36728>36729>36731>36733>36734>36735>36737>36738>36739>36741>36742>36743>36745>36747>36748>36750>36751>36753>36754>36756>36758>36759>36761>36762>36763>36765>36766>36767>36769>36770>36772>36773>36775>36777>36778>36780>36781>36783>36784	10625	436	10436	9976	10000	59	AS:i:9890	bq:Z:?????...??????	cs:Z::790*AG:266*CA:130*AT:597*GT:161*AG:892*TC:1105*AT:450*AC:482*CT:123*AG:561*AT:102*GA:108*CT:1102*CA:584*CT:116*GT:95*CA:98*AC:85*CG:26*TG:177*CT:105*CA:510*CT:165*CA:1146	dv:f:0.0024
Clone this wiki locally