Graphs, a.k.a. networks, are useful ways to represent relational data. Perhaps the most intuitive example is a social network. The nodes of the network represent users and the edges between pairs of nodes represent which users follow one another. Networks come up in many other domains as well. For example, in neuroscience we use networks to model connectivity between neurons. In chemistry, networks are used to represent how atoms bond together to form molecules. There is a rich line of work in statistics, social science, and machine learning on modeling random networks and making predictions from network-valued data.
Let
- An undirected network is one in which
$X_{ij} = 1 \implies X_{ji} = 1$ . That is,$\mbX$ is symmetric. - A directed network may have an asymmetric adjacency matrix.
- A network has no self-loops if
$X_{ii} = 0 \forall i=1,\ldots,n$ . - A network is simple if it is undirected and has no self loops.
- A network is connected if for for all pairs of nodes
$(i,j)$ there exists a path from node$i$ to node$j$ . - The degree of a node is its number of neighbors. In a directed graph, we may distinguish between the in-degree and out-degree.
- A graph is called sparse if the number of edges
$E$ is$\cO(n)$ . Otherwise it is called dense.
Sometimes we want to model graphs where the edges come with some metadata, like a weight. We can represent a weighted graph as a matrix
These definitions are just a brief overview of terms from graph theory. There is an enormous literature in mathematics, computer science, probability, etc. dealing with these objects.
There are many questions we might like to answer with graphs. We will focus on the following three:
-
Edge prediction: given observations of a subset of the adjacency matrix, predict the missing values of
$\mbX$ . - Community discovery: given an adjacency matrix, assign nodes to clusters based on their connectivity.
- Feature prediction (or supervised learning, or simply regression): given an adjacency matrix representing a graph (of possibly varying number of nodes), predict some property of that graph. For example, given a graph representing a molecule, predict whether it will be fluorescent.
One way to answer these questions is by building a generative model for networks. It is not strictly necessary to have a generative model, as we'll talk about at the end of this lecture, but having one will allow us to answer these kinds of questions and more.
Let's start by considering a distribution on binary adjacency matrices with
:::{admonition} How many undirected graphs without self loops are there?
:class: dropdown
There are
Random network models are distributions over this space. One of the design considerations for such models is balancing
- expressivity: the need for a model that captures a range of realistic connectivity patterns, with
- interpretability: the want for a model to have relatively few parameters that govern its generative process.
The Erdős-Rényi (ER) model for random graphs is perhaps the simplest non-trivial model. Under this model, each edge is an iid Bernoulli random variable,
\begin{align*}
X_{ij} &\iid\sim \mathrm{Bern}(\rho)
\end{align*}
with a single parameter
The expected number of edges in an undirected model without self loops is
For example, Erdős and Rényi showed that
While mathematically attractive, the ER model is a rather simple model of networks. One way to enrich our models is by introducing auxiliary variables.
In a stochastic block model (SBM), each node
:::{admonition} SBM is a generalization of the Erdős-Rényi model
:class: tip
Note that when
Stochastic block models are like the graph version of a discrete mixture model. They are often used for community discovery — clustering nodes into groups based on their connectivity patterns. In this application, the community assignments are treated as latent variables drawn from a categorical prior, \begin{align*} z_i &\sim \mathrm{Cat}(\mbpi). \end{align*} We will give the model parameters conjugate priors as well, \begin{align*} \rho_{k,k'} &\iid{\sim} \mathrm{Beta}(\alpha, \beta) &\text{for } k,k'&= 1,\ldots,K \ \mbpi &\sim \mathrm{Dir}(\gamma \mbone_K) \end{align*} The goal is to infer the posterior distribution of community assignments and parameters given the observed adjacency matrix.
One simple way to approch this problem is Gibbs sampling. In this model, the latent variables and parameters are conditionally conjugate. The complete conditional distributions of the community assignment variables are, \begin{align*} \Pr(z_i = k \mid \mbX, \mbpi, \mbR, \mbz_{\neg i}) &\propto p(z_i = k \mid \mbpi) \prod_{j \neq i} \mathrm{Bern}(X_{ij} \mid \rho_{k,z_j}) \ &\propto \pi_k \prod_{j \neq i} \mathrm{Bern}(X_{ij} \mid \rho_{k,z_j}). \end{align*} This is a simple categorical distribution.
Given the community assignments, the connection probabilities
:::{admonition} Exercise
Derive expressions for
Just as the SBM is analogous to a discrete mixture model, a latent space model (LSM) is analogous to a continuous factor model.
In an LSM, each node is associated with a continuous latent variable
Such models are widely used in the analysis of social network data {cite:p}hoff2002latent. The latent variables
:::{admonition} Connection to Mixed Effects Models :class: tip Think back to HW2: Bayesian GLMs, where you developed Bayesian generalized linear mixed effects models. We have the same sort of model here, but for graphs! :::
To complete the generative model, assume a Gaussian prior on the latent variables,
\begin{align*}
\mbz_i &\iid\sim \mathrm{N}(\mbzero, \mbI).
\end{align*}
Unlike the SBM, the conditional distributions are not conjugate in the LSM. Instead,
\begin{align*}
p(\mbz_i \mid \mbX) &\propto \mathrm{N}(\mbz_i \mid \mbzero, \mbI) \prod_{j \neq i} \mathrm{Bern}(\sigma(\mbz_i^\top \mbLambda \mbz_j + b)).
\end{align*}
One way to approach the posterior inference problem is MCMC. For example, we could use a Metropolis-Hastings algorithm to update the latent variables one at a time. This is what {cite:t}hoff2002latent suggest.
Alternatively, we could use Pólya-gamma augmentation, like we did in HW2, to render the model conditionally conjugate. In that case, each latent variable would have a Gaussian conditional, holding the others fixed.
Finally, a latent position model (sometimes called a latent distance model) follows a similar form to the LSM above. Let hoff2007modeling showed that the LSM weakly generalizes the latent position model, in the sense that an LSM with
The models above all assume that the edges
Consider modeling a collection of variables
de Finetti's theorem states that as
Extensions of de Finetti's theorem have been proven for partially exchangeable arrays, like graphs {cite:p}aldous1981representations,hoover1979relations. A random graph is exchangeable if its distribution is invariant to permutations of the node labels. That is, the distribution of the matrix
As discussed by {cite:t}orbanz2013bayesian, a simple graph is exchangeable if and only if there is a random symmetric function
The random function
While exchangeability and invariance to node relabeling seem intuitive properties of many graphs, they have some unrealistic consequences. For example, exchangeable random graphs are either dense or empty (Fact VII.2 in {cite:p}orbanz2013bayesian).
To see why, consider a simple random graph on
&= \sum_{i=1}^n \sum_{j=1}^{i-1} \bbE[X_{ij}] \
&= \sum_{i=1}^n \sum_{j=1}^{i-1} \bbE[W(U_i, U_j)] \
&= \sum_{i=1}^n \sum_{j=1}^{i-1} \bbE[\bbE[W(U_i, U_j) \mid U_i, U_j]] \
&= {n \choose 2} \epsilon \
\text{where } \epsilon &= \frac{1}{2} \int_{[0,1]^2} W(u,v) \dif u \dif v.
\end{align*}
The
There are two possibilities:
-
$\epsilon = 0$ in which case the graph is empty, or -
$\epsilon > 0$ in which case the expected number of edges is${n \choose 2} \epsilon = \Theta(n^2)$ , which implies that the graph is almost surely dense.
So how can we generate sparse random graphs? One way is to imagine nodes arriving in a sequence. {cite:t}bloem2018random propose the following random walk graph model,
- Initialize the graph
$G_1$ with a single edge connecting two vertices. - For
$t=2,\ldots,T$ generate$G_t$ from$G_{t-1}$ as follows: a. Select a vertex$V$ from$G_{t-1}$ at random b. With probability$\alpha$ , attach a new vertex to$V$ c. Else, run a simple random walk, starting at$V$ , for a Poisson distributed number of time steps. Connect the terminal vertex$V'$ to$V$ if they are not already connected; otherwise add a new vertex to$V$ .
By construction, each step of the algorithm adds one edge and at most one node, so the graph is sparse — total number of edges is
Unfortunately, inference in this model is considerably more challenging. Typically, we would observe the final graph, but not the order in which the nodes arrived. Thus, we need to perform inference over the underlying permutation of nodes. Methods like sequential Monte Carlo could be used for this purpose, but it is a nontrivial inference problem {cite:p}bloem2018random.
The models discussed thus far involve relatively simple generative processes that produce rich distributions over random graphs. The marginal distributions
Another alternative is to directly parameterize a marginal distribution over graphs. Exponential random graph models (ERGMs) assume the marginal distribution belongs to an exponential family,
\begin{align*}
p(\mbX) &= \exp \left{ \sum_{k=1}^K \eta_k t_k(\mbX) - A(\mbeta) \right}.
\end{align*}
The model is defined by sufficient statistics
For example, the sufficient statistic
In the machine learning literature, models like these are called energy based models because they prescribe the form of the log probability up to an unknown normalizing constant. The challenge is that for ERGMs, the normalizing constant is typically intractable,
\begin{align*}
A(\mbeta) &= \log \sum_{\mbX'} \exp \left{ \sum_{k=1}^K \eta_k t_k(\mbX') \right}.
\end{align*}
since the sum ranges over
Even generating a sample of an ERGM can be difficult. Typically, we need to resort to approximate methods like MCMC.
:::{admonition} Exercise
Come up with a Metropolis-Hastings algorithm to draw a sample from an ERGM with
Learning the parameters of an ERGM is even more challenging. There have been many proposed algorithms. Here, we discuss a method called persistent contrastive divergence {cite:p}tieleman2008training, which was originally proposed for training another class of energy based models called restricted Boltzman machines.
The idea is to maximize the log likelihood using an approximate gradient, \begin{align*} \frac{\partial}{\partial \eta_k} \cL(\mbeta) &= \frac{\partial}{\partial \eta_k} \left[\sum_{k=1}^K \eta_k t_k(\mbX) - A(\mbeta) \right] \ &= t_k(\mbX) - \frac{\partial}{\partial \eta_k} A(\mbeta) \ &= t_k(\mbX) - \bbE[t_k(\mbX')], \end{align*} where we used the fact that gradients of the log normalizer yield expected sufficient statistics.
Persistent contrastive divergence is stochastic gradient ascent on the marginal likelihood using a simple approximation to the expectation.
-
Initialize an MCMC chain at
$\mbX^{(0)} = \mbX$ , the observed graph. Initialize parameters$\mbeta^{(0)}$ . -
For iterations
$i=1,2\ldots$ a. Update
$\mbX^{(i)}$ by starting at$\mbX^{(i-1)}$ and applying a Markov transition operator with stationary distribution$p(\mbX \mid \mbeta^{(i-1)})$ .b. Use the sample
$\mbX^{(i)}$ to obtain a one-sample Monte Carlo estimate of the expected sufficient statistics,$\hat{t}_k^{(i)} = t_k(\mbX^{(i)})$ .c. Update the parameters
$\eta_k^{(i)} \leftarrow \eta_k^{(i-1)} + \alpha_i (t_k(\mbX) - \hat{t}_k^{(i)})$