Skip to content

Snarls and chains

Faith Okamoto edited this page Dec 31, 2025 · 5 revisions

Papers, code, documentation, and other vg outputs refer to "snarls" and "chains". This page explains those concepts and provides examples.

Variation graphs can become very large and complicated. It is often useful to break down the graph and describe the commonly occurring nested substructures. Variations in the graph often have a common topological motif, with the variant described by one or more nodes that are flanked by two nodes representing conserved sequence. These topological motifs are called snarls. Also see our explainer videos.

Representing variants in a variation graph

In the following example, the variation graph represents a single nucleotide variant/polymorphism (SNV/P). The two alleles of the SNP are represented by nodes 2 and 3, and the conserved sequence flanking the SNP are represented by nodes 1 and 4.

SNP: node 1 connects to nodes 2 and 3, each of which connect to node 4

Similarly, for an insertion/deletion (indel), the inserted sequence is represented by node 2, while the deletion is represented by the edge between nodes 1 and 3. This is an example of a "regular" deletion allele. Another option is a "flat" allele where every allele is allocated at least one base. Doing so makes it easier to label deletions via a node. A "flat" allele could have, for example, deleted the C from the end of node 1, added a C to the start of node 2, and created a new node 3 consisting of just a C that would go in the middle of the deletion edge.

indel: node 1 connects to node 2 and node 3, with node 2 also connected to node 3

A duplication can be represented with a node that is connected to itself by an edge from the start of the node to the end of the node. A path can traverse the node multiple times by taking this looping edge.

duplication: node 1's start connects to node 2's start, and node 2's end is connected both to its start and to node 3's start

An inversion is represented by a node that can be traversed in either direction in a left-to-right (or right-to-left) traversal of the graph.

inversion: node 2's start and end are each connected to both node 1's start and node 3's end (4 edges total)

Each of these graphs share the common motif where the variant is represented by a subgraph representing the variable sequence, flanked by two nodes representing conserved sequence. In other words, there are two or more alternative paths between two nodes (entrance/exit) in the graph. We refer to this topological motif as a "snarl". The term "bubble" is also common in the literature. A snarl is a generalization of a bubble. Generally, we use the term snarl to refer to the subgraph not including the two flanking nodes.

In vg, there is almost no difference in the way variants of any size are represented. It is best to first think about our reference, which is a Path in the graph, or a string of nodes connected by edges in a linear fashion. A variant's size does not affect the way it appears in the graph. If we did not impose a limit on the number of basepairs in a node, the graph structure representing a very large SV and a single SNP would be indistinguishable.

The snarl decomposition

Snarls

Formally, a snarl is defined by a pair of node sides, ${x_{{l,r}}, y_{{l,r}}}$, that delimit a subgraph between them. The nodes x and y are referred to as the boundary nodes of the snarl. Two node sides define a snarl if they are:

  1. separable: splitting the boundary nodes into their two node sides disconnects the snarl from the rest of the graph,

and

  1. minimal: there is no node z within the snarl that is separable with the boundary nodes.

a deletion (connections between nodes 4 to 5, 4 to 6, and 5 to 6) is separated out to show just node 5 and the edges of 4 and 6 around it

In this example, the subgraph between nodes 4 and 6 is a snarl, highlighted in green. By splitting the boundary nodes into their two node sides, the subgraph containing the snarl becomes unreachable with the rest of the graph. This snarl is delimited by the right side of node 4 and the left side of node 6. For specificity, the snarl may be written as "Snarl $4_r-6_l$" to refer to the node sides, or "Snarl 4fd-6fd" to refer to the traversal of the nodes. However, as there is only one snarl with nodes 4 and 6 as boundary nodes, we will generally refer to it as "Snarl 4-6".

Chains

Snarls often occur successively with a shared boundary node between them. A sequence of consecutive nodes and snarls is called a "chain".

a SNP and deletion next to each other, with each highlighted

In this example, the graph contains two snarls, snarl 1-4 and snarl 4-6. The two snarls are part of the same chain, chain 1-6. This chain is comprised of the two snarls and nodes 1, 4, and 6.

Nesting

Snarls and chains may be nested inside of each other. Conceptually, this occurs when multiple variants affect the same locus, such as a SNP that is nested within an insertion. In a graph, a snarl contains another snarl if all of the nodes in the latter snarl are contained in the subgraph of the former. A snarl contains a chain if it contains all of the chains component nodes and snarls.

insertion with nested SNP in snarl 1-6: connections 1 to 2, 2 to 3, 2 to 4, 3 to 5, 4 to 5, 5 to 6, and 1 to 6

The snarl tree

The nesting relationship of snarls and chains in a graph is described by its "snarl decomposition" or "snarl tree". Each node, snarl, and chain in the graph is represented as a node in the snarl tree. A chain is a child of a snarl if it is contained in the snarl and no other descendant of the snarl. A node or snarl is a child of a chain if it is a component of the chain. Every node and snarl is a child of a chain; a chain containing just one node is called a “trivial chain”. Because of this, the snarl tree is composed of alternating layers of snarls and chains, usually starting with a top-level chain as the root.

snarl tree for a simple variation graph as described below

In this example, chain 1-10 is the top-level chain containing the entire variation graph. Chain 1-10 is comprised of 6 children: nodes 1, 4, 9, and 10, and snarls 1-4 and 4-9. Snarl 1-4 contains two children, chains 2 and 3, each of which are trivial chains containing only one node. Between nodes 9 and 10 is a "trivial snarl" containing only an edge between the boundary nodes. These are generally omitted from the snarl tree as they are implied by two consecutive nodes.

Netgraphs

Nodes, snarls, and chains all have the property that they are connected to the rest of the graph by two node sides. Based on this property, snarls and chains can be treated as nodes within their parents, obscuring the presence of a subgraph between the boundaries of the child. A “netgraph” is a view of a snarl where its child chains are replaced by nodes.

example of colapsing snarls inside of subchains to make a netgraph as described below

This example shows the netgraph-view of snarl 1-12. In this view, chains 2-7 and 8-11 are replaced by a single node that represents the subgraph of the chain. Because these chains are also reachable by two node sides, the chains can be replaced by a single node without changing the topology of the snarl.

Complex cases

In general, variation graphs built from human data tend to have very long top-level chains with few, simple snarls and shallow nesting. However, there can be larger more complicated snarl structures that need to be taken into account when building data structures and algorithms based on the snarl decomposition.

  • Snarls/chains that are not start-end connected It is possible for a valid snarl or chain to not have a valid path connecting its boundary nodes. This means that there is no valid path within its parent chain from the start of the chain to the end of the chain.

snarl 4-7 is not start-end connected; edges are 4's end to 6's start, 6's end to 5's end, and 5's end to 7's start

  • Looping chains It is possible for the top-level chain to have the same node as its boundary nodes. This is possible for cyclic chromosomes.

variation graph where last node's end connects to first node's start

In this example, the chain 1-1 can loop by taking the edge from 11 to 1.

  • Looping chains with non start-end connected snarls This is usually an artifact of the way snarls and chains are found in the graph.

normal insertion with nested SNP graph, but with one node of the insertion having a very long sequnece

Intuitively in this example, there is a top level chain between nodes 1 and 7. However, the snarl-finding algorithm will try to root the snarl tree on the longest sequence of nodes in the chain, which can actually be found within the snarl. In this example, the snarl $6_r-3_l$ creates a valid snarl, even though its two boundary nodes are not reachable in a valid path through the snarl.

zoom in on a SNP with a snarl tree showing the allele nodes are the chain bounds

This can also happen when the boundary nodes are chosen to be the two alleles of a SNP.

  • Changing direction within a chain in a snarl If there edges that allow a path to turn around within a chain, they won't be taken into account in the connectivity of the netgraph of its parent snarl. This isn't really an edge case but it can be confusing and is easy to forget about.

example snarl with paths that can turn around, due to a nested insertion

In this example, there is a path from the right side of node 1 back to the right side of node 1, but it can't be seen in the netgraph view.

Implementation details

In vg, we have two different data structures for representing snarls: the Snarl Manager and the Distance Index .

In the Distance Index, the root-level structure of the snarl tree is a "snarl", which contains the entire graph. Each top-level chain is a child of this root snarl. There may be connectivity within the root snarl if a connected component of the graph doesn't decompose into a single chain. Each child of the root snarl is therefore not necessarily a separate connected component.

The Distance Index's definition of snarl does not include the boundary nodes of the snarl. The bound of a snarl is a "sentinel" that represents the inner node side of the boundary node, but not technically the node itself.

Paths and Snarls and SnarlTraversals

We may also wish to store and manipulate the underlying haplotype paths which are formed from following certain alleles in each snarl.

Paths in vg are annotations on Nodes of the graph. This may include mismatches, insertions, and deletions relative to the basepairs of the node. Paths are permitted to be empty (i.e. in the case of a flat genomic deletion), however these empty paths are not useful in practice as they do not provide any coordinate reference to the graph. Paths are composed of Mappings, which are a position (A node ID and a basepair where the mapping begins within that node) and a list of Edits (the operations performed to transform the reference sequence to that described by the path). Tracing along the path's Mappings and replacing the "from" field of each Edit with its "to" field would yield the sequence the path intends to spell out.

Paths in vg only represent alleles, excluding the start and end nodes of a bubble. SnarlTraversals are exactly equivalent to Paths, except they are restricted to only have perfect matches to the underlying graph (i.e. no Edit fields exist nor would they be permissible except as exact matches). SnarlTraversals must be associated with a Snarl.

Snarls and SnarlTraversals solve the problem of empty paths. By requiring an entrance/exit for each Snarl, we can represent a deleted genomic region as a SnarlTraversal containing no nodes but associated with a Snarl describing the flanking entrance/exit nodes.

Clone this wiki locally