-
Notifications
You must be signed in to change notification settings - Fork 208
Extracting a FASTA from a Graph
Graph references often contain linear references within them, which you might want copies of for, for example, calling variants with a linear-reference-based caller like Google's DeepVariant.
If you don't already have a FASTA file for an assembly that is included in a graph, you can use vg to extract the assembly FASTA directly from the graph, like this:
vg paths --extract-fasta -x test/graphs/rgfa_with_reference.rgfa --paths-by GRCh38
Here, the argument to -x should be the graph file, in rGFA, GFA, .vg, .gbz, or any other graph file format that vg can read (see File Formats). The argument to --paths-by should be the prefix of the set of paths you would like to extract; generally you can use a sample or assembly name here. You can use vg paths --list -x <the graph> to get a list of all paths available.
This will produce a FASTA file on standard output:
>GRCh38#0#chr1
GGGGTACA
In most cases, the sequence names in the FASTA will be in PanSN format (see Path Metadata Model); these will match the names used by vg surject, and so a FASTA extracted like this is easy to use with a BAM file produced by vg surject.
To save it to a file, you can redirect the output with >.
If you are interested in extracting haplotype paths from a .gbwt file, you can pass the .gbwt file with the -g option to vg paths, and the corresponding .gg file or any matching graph with -x.