Mapping long reads with Giraffe

This tutorial covers mapping long reads to a pangenome reference with vg giraffe. This tutorial covers DNA reads from long-read sequencers like those from Oxford Nanopore or PacBio; for RNA-seq, see Long‐read RNA‐seq with pre‐existing pangenome. For short reads like Illumina or Element, see Mapping short reads with Giraffe.

Installation

Long-read support was released in vg 1.63.0. Make sure to install a new enough version of vg.

Obtaining Data

You will need a file of reads to align. For this example, we will use longread/hifi.fq. You will also need a pangenome to map to. For this example, we will use the "AF-Filtered VG indexes" graph from the Human Pangenome Reference Consortium version 1.1 release, based on a CHM13 linear reference. You can get it here:

https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.d9.gbz

You can also pre-download the distance index for the graph to save on indexing time:

https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.d9.dist

Don't download any other index files; the HPRC 1.1 graphs pre-date long-read support in Giraffe, and the other index files need to be re-generated.

If you don't want to use the full-size graph, you can use a small test graph included with vg to follow along, along with a FASTA of its test version of the CHM13 reference.

vg gbwt --gbz-format -g hprc-v1.1-mc-chm13.d9.gbz -G longread/graph.gfa
vg paths --extract-fasta --sample CHM13 -x hprc-v1.1-mc-chm13.d9.gbz >CHM13.pansn.fa

Indexing the Graph

While vg giraffe can do its own indexing, you can pre-index the graph to prevent your first vg giraffe run from needing to do it (and then distribute the indexes to your cluster nodes, if operating on a cluster). You can do this with:

vg autoindex --workflow lr-giraffe --prefix hprc-v1.1-mc-chm13.d9 --gbz hprc-v1.1-mc-chm13.d9.gbz

This will create some index files.

ls hprc-v1.1-mc-chm13.d9.*

hprc-v1.1-mc-chm13.d9.dist
hprc-v1.1-mc-chm13.d9.gbz
hprc-v1.1-mc-chm13.d9.longread.withzip.min
hprc-v1.1-mc-chm13.d9.longread.zipcodes

The new files are:

hprc-v1.1-mc-chm13.d9.dist (the distance index, if you don't have it already)
hprc-v1.1-mc-chm13.d9.longread.withzip.min (the "minimizers" used to find seeds, with embedded "zipcodes")
hprc-v1.1-mc-chm13.d9.longread.zipcodes (the zipcodes too large to store in the minimizer file)

Mapping Long Reads

To invoke vg giraffe in long-read mode, use the --parameter-preset/-b option to specify hifi (for PacBio HiFi reads) or r10 (for Oxford Nanopore R10 chemistry reads), as appropriate for your data. Use the --gbz-name/-Z option to specify the .gbz graph file, and the --fastq-in/-f option to specify the input reads in FASTQ format. You can also add --progress/-p for informative progress messages, and remember to redirect the aligned reads to a GAM file.

vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p >hifi.mapped.gam

Giraffe will automatically find the correct long-read indexes to go with the .gbz graph, or generate them if they are missing. If you want to pass them explicitly, you can use the --minimizer-name/-m, --zipcode-name/-z, and --dist-name/-d options.

Giraffe will also guess the number of threads to use, but you can override this with the --threads/-t option.

For more on the many available vg giraffe options, see vg manpage#giraffe.

Mapping in Other Formats

If you prefer standard GAF-format output, you can ask for that with the --output-format/-o option:

vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o GAF >hifi.mapped.gaf

And if you want to go straight to linear-reference BAM, that's available too. But when working with long reads, remember to add --prune-low-cplx/-P to the command when surjecting to BAM, to re-compute slippery parts of the alignment against the linear target reference, since with long reads you won't be able to do indel realignment later. You also probably want to select a single reference assembly to surject to, by using the --ref-paths option. The HPRC graphs contain both CHM13 and GRCh38 assemblies, and by default BAM reads will be aligned to whichever contig in whichever reference assembly they match best.

samtools dict CHM13.pansn.fa >CHM13.pansn.dict
vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o BAM --ref-paths CHM13.pansn.dict -P >hifi.mapped.bam

Adapting BAM Sequence Names

After making a BAM, make sure to check the headers with samtools view -H hifi.mapped.bam, and make sure the sequence names are what you want. They will probably be in PanSN format (like CHM13#0#chr1). If your downstream tools expect your BAM to have sequence names without the assembly name in them, like just chr1, you will need to reheader your BAM:

samtools reheader -c "sed s/CHM13#0#//g" hifi.mapped.bam >hifi.mapped.reheadered.bam

Working with Long Read Alignments

Once you have aligned your reads, you will want to work with them.

Collecting GAM Statistics

If you have a GAM, you can get some quality control statistics with:

vg stats -a hifi.mapped.gam

If you're following along with the single-read example input, you might see something like this:

Total alignments: 1
Total primary: 1
Total secondary: 0
Total aligned: 1
Total perfect: 1
Total gapless (softclips allowed): 1
Total paired: 0
Total properly paired: 0
Alignment score: mean 15580, median 15580, stdev 0, max 15580 (1 reads)
Mapping quality: mean 60, median 60, stdev 0, max 60 (1 reads)
Insertions: 0 bp in 0 read events
Deletions: 0 bp in 0 read events
Substitutions: 0 bp in 0 read events
Softclips: 0 bp in 0 read events
Total time: 0.00598654 seconds
Speed: 167.041 reads/second

If you have unusually few reads with mapping quality 60, or an unusually large number of softclipped bases, that could be a sign of trouble. To pull more granular statistics (e.g. mapping quality by read name), use vg filter --tsv-out.

Converting Formats

If you mapped your reads in one format, you might need to convert them to a different one for analysis.

You can convert GAM to GAF:

vg convert --gam-to-gaf hifi.mapped.gam hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gaf

You can also go the other way and convert GAF to GAM:

vg convert --gaf-to-gam hifi.mapped.gaf hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gam

Note that these conversions require the graph file!

Stripping GAM Metadata

If you are trying to use a vg giraffe GAM file with an older non-vg tool that reads GAM, or an old version of vg, and you get messages like:

what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 12345 group 45678

Then you may be dealing with a tool that does not understand the run-level metadata embedded in the GAM file by newer versions of Giraffe. In that case, converting from GAM to GAF and back to GAM again can be used to remove that metadata.

Surjecting Long Reads to BAM

If you have GAM or GAF alignments and you want linear-reference BAM files, for tools like DeepVariant, you can use vg surject to "surject" your alignments and squash them down to a linear reference. With long reads, it is important to use the --prune-low-cplx/-P option of vg surject, because it is common for the reads to span low-complexity regions such as tandem repeats, and most downstream tools (such as variant callers) will expect local alignments to such regions to be optimal against the linear reference.

To specify the particular linear reference to surject to, use the --ref-paths option to provide a path list or dict file. (If you forget this, you will get reads surjected to whichever reference they match better.) The file should match the contig naming in the graph. To check this, use:

vg paths --list --reference-paths -x hprc-v1.1-mc-chm13.d9.gbz

If your paths are in PanSN name format (like CHM13#0#chr1), you need a path list or Samtools-style .dict file in PanSN format. Once you have it, you can do surjection:

vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz hifi.mapped.gam --prune-low-cplx --ref-paths CHM13.pansn.dict >hifi.mapped.bam

Note that the surjected BAM will use sequence names that match the path names in the graph (i.e. PanSN names), and may need to be processed with samtools reheader for use with a non-PanSN FASTA file. See Mapping long reads with Giraffe#Adapting BAM Sequence Names above.

If you want to surject GAF alignments, you also need the --gaf-input/-G option:

vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz -G hifi.mapped.gaf --prune-low-cplx --ref-paths CHM13.pansn.dict >hifi.mapped.bam

Conclusion

After following along with the tutorial, remember to clean up:

rm hifi.mapped.{gaf,gam,bam,reheadered.bam} hprc-v1.1-mc-chm13.d9.{dist,gbz,longread.withzip.min,longread.zipcodes} CHM13.pansn.{fa,dict}

Mapping long reads with Giraffe

Installation

Obtaining Data

Indexing the Graph

Mapping Long Reads

Mapping in Other Formats

Adapting BAM Sequence Names

Working with Long Read Alignments

Collecting GAM Statistics

Converting Formats

Stripping GAM Metadata

Surjecting Long Reads to BAM

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally