Understanding Convolutions on Graphs

Understanding the building blocks and design choices of graph neural networks.

Published

Sept. 2, 2021

DOI

10.23915/distill.00032

This article is one of two Distill publications about graph neural networks. Take a look at A Gentle Introduction to Graph Neural Networks for a companion view on many things graph and neural network related.

Many systems and interactions - social networks, molecules, organizations, citations, physical models, transactions - can be represented quite naturally as graphs. How can we reason about and make predictions within these systems?

One idea is to look at tools that have worked well in other domains: neural networks have shown immense predictive power in a variety of learning tasks. However, neural networks have been traditionally used to operate on fixed-size and/or regular-structured inputs (such as sentences, images and video). This makes them unable to elegantly process graph-structured data.

Neural networks generally operate on fixed-size input vectors. How do we input a graph to a neural network?

Graph neural networks (GNNs) are a family of neural networks that can operate naturally on graph-structured data. By extracting and utilizing features from the underlying graph, GNNs can make more informed predictions about entities in these interactions, as compared to models that consider individual entities in isolation.

GNNs are not the only tools available to model graph-structured data: graph kernels and random-walk methods were some of the most popular ones. Today, however, GNNs have largely replaced these techniques because of their inherent flexibility to model the underlying systems better.

In this article, we will illustrate the challenges of computing over graphs, describe the origin and design of graph neural networks, and explore the most popular GNN variants in recent times. Particularly, we will see that many of these variants are composed of similar building blocks.

First, let’s discuss some of the complications that graphs come with.

The Challenges of Computation on Graphs

Lack of Consistent Structure

Graphs are extremely flexible mathematical models; but this means they lack consistent structure across instances. Consider the task of predicting whether a given chemical molecule is toxic  :

The molecular structure of non-toxic 1,2,6-trigalloyl-glucose. The molecular structure of toxic caramboxin.
Left: A non-toxic 1,2,6-trigalloyl-glucose molecule.
Right: A toxic caramboxin molecule.

Looking at a few examples, the following issues quickly become apparent:

Representing graphs in a format that can be computed over is non-trivial, and the final representation chosen often depends significantly on the actual problem.

Node-Order Equivariance

Extending the point above: graphs often have no inherent ordering present amongst the nodes. Compare this to images, where every pixel is uniquely determined by its absolute position within the image!

Representing the graph as one vector requires us to fix an order on the nodes. But what do we do when the nodes have no inherent order?
Representing the graph as one vector requires us to fix an order on the nodes. But what do we do when the nodes have no inherent order? Above: The same graph labelled in two different ways. The alphabets indicate the ordering of the nodes.

As a result, we would like our algorithms to be node-order equivariant: they should not depend on the ordering of the nodes of the graph. If we permute the nodes in some way, the resulting representations of the nodes as computed by our algorithms should also be permuted in the same way.

Scalability

Graphs can be really large! Think about social networks like Facebook and Twitter, which have over a billion users. Operating on data this large is not easy.

Luckily, most naturally occuring graphs are ‘sparse’: they tend to have their number of edges linear in their number of vertices. We will see that this allows the use of clever methods to efficiently compute representations of nodes within the graph. Further, the methods that we look at here will have significantly fewer parameters in comparison to the size of the graphs they operate on.

Problem Setting and Notation

There are many useful problems that can be formulated over graphs:

Examples of problems that can be defined over graphs.
Examples of problems that can be defined over graphs. This list is not exhaustive!

A common precursor in solving many of these problems is node representation learning: learning to map individual nodes to fixed-size real-valued vectors (called ‘representations’ or ‘embeddings’).

In Learning GNN Parameters, we will see how the learnt embeddings can be used for these tasks.

Different GNN variants are distinguished by the way these representations are computed. Generally, however, GNNs compute node representations in an iterative process. We will use the notation hv(k)h_v^{(k)} to indicate the representation of node vv after the kthk^{\text{th}} iteration. Each iteration can be thought of as the equivalent of a ‘layer’ in standard neural networks.

We will define a graph GG as a set of nodes, VV, with a set of edges EE connecting them. Nodes can have individual features as part of the input: we will denote by xvx_v the individual feature for node vVv \in V. For example, the ‘node features’ for a pixel in a color image would be the red, green and blue channel (RGB) values at that pixel.

For ease of exposition, we will assume GG is undirected, and all nodes are of the same type. These kinds of graphs are called ‘homogeneous’. Many of the same ideas we will see here apply to other kinds of graphs: we will discuss this later in Different Kinds of Graphs.

Sometimes we will need to denote a graph property by a matrix MM, where each row MvM_v represents a property corresponding to a particular vertex vv.

Extending Convolutions to Graphs

Convolutional Neural Networks have been seen to be quite powerful in extracting features from images. However, images themselves can be seen as graphs with a very regular grid-like structure, where the individual pixels are nodes, and the RGB channel values at each pixel as the node features.

A natural idea, then, is to consider generalizing convolutions to arbitrary graphs. Recall, however, the challenges listed out in the previous section: in particular, ordinary convolutions are not node-order invariant, because they depend on the absolute positions of pixels. It is initially unclear as how to generalize convolutions over grids to convolutions over general graphs, where the neighbourhood structure differs from node to node. The curious reader may wonder if performing some sort of padding and ordering could be done to ensure the consistency of neighbourhood structure across nodes. This has been attempted with some success , but the techniques we will look at here are more general and powerful.

Convolutions in CNNs are inherently localized. Neighbours participating in the convolution at the center pixel are highlighted in gray.
GNNs can perform localized convolutions mimicking CNNs. Hover over a node to see its immediate neighbourhood highlighted on the left. The structure of this neighbourhood changes from node to node.

We begin by introducing the idea of constructing polynomial filters over node neighbourhoods, much like how CNNs compute localized filters over neighbouring pixels. Then, we will see how more recent approaches extend on this idea with more powerful mechanisms. Finally, we will discuss alternative methods that can use ‘global’ graph-level information for computing node representations.

Polynomial Filters on Graphs

The Graph Laplacian

Given a graph GG, let us fix an arbitrary ordering of the nn nodes of GG. We denote the 010-1 adjacency matrix of GG by AA, we can construct the diagonal degree matrix DD of GG as:

Dv=uAvu. D_v = \sum_u A_{vu}.
The degree of node vv is the number of edges incident at vv.

where AvuA_{vu} denotes the entry in the row corresponding to vv and the column corresponding to uu in the matrix AA. We will use this notation throughout this section.

Then, the graph Laplacian LL is the square n×nn \times n matrix defined as: L=DA. L = D - A.

The Laplacian LL for an undirected graph GG, with the row corresponding to node C\textsf{C} highlighted. Zeros in LL are not displayed above. The Laplacian LL depends only on the structure of the graph GG, not on any node features.

The graph Laplacian gets its name from being the discrete analog of the Laplacian operator from calculus.

Although it encodes precisely the same information as the adjacency matrix AA In the sense that given either of the matrices AA or LL, you can construct the other. , the graph Laplacian has many interesting properties of its own. The graph Laplacian shows up in many mathematical problems involving graphs: random walks, spectral clustering, and diffusion, to name a few. We will see some of these properties in a later section, but will instead point readers to this tutorial for greater insight into the graph Laplacian.

Polynomials of the Laplacian

Now that we have understood what the graph Laplacian is, we can build polynomials of the form: pw(L)=w0In+w1L+w2L2++wdLd=i=0dwiLi. p_w(L) = w_0 I_n + w_1 L + w_2 L^2 + \ldots + w_d L^d = \sum_{i = 0}^d w_i L^i. Each polynomial of this form can alternately be represented by its vector of coefficients w=[w0,,wd]w = [w_0, \ldots, w_d]. Note that for every ww, pw(L)p_w(L) is an n×nn \times n matrix, just like LL.

These polynomials can be thought of as the equivalent of ‘filters’ in CNNs, and the coefficients ww as the weights of the ‘filters’.

For ease of exposition, we will focus on the case where nodes have one-dimensional features: each of the xvx_v for vVv \in V is just a real number. The same ideas hold when each of the xvx_v are higher-dimensional vectors, as well.

Using the previously chosen ordering of the nodes, we can stack all of the node features xvx_v to get a vector xRnx \in \mathbb{R}^n.

Fixing a node order and collecting all node features into a single vector.
Fixing a node order (indicated by the alphabets) and collecting all node features into a single vector xx.

Once we have constructed the feature vector xx, we can define its convolution with a polynomial filter pwp_w as: x=pw(L) x x’ = p_w(L) \ x To understand how the coefficients ww affect the convolution, let us begin by considering the ‘simplest’ polynomial: when w0=1w_0 = 1 and all of the other coefficients are 00. In this case, xx’ is just xx: x=pw(L) x=i=0dwiLix=w0Inx=x. x’ = p_w(L) \ x = \sum_{i = 0}^d w_i L^ix = w_0 I_n x = x. Now, if we increase the degree, and consider the case where instead w1=1w_1 = 1 and and all of the other coefficients are 00. Then, xx’ is just LxLx, and so: xv=(Lx)v=Lvx=uGLvuxu=uG(DvuAvu)xu=Dv xvuN(v)xu \begin{aligned} x’_v = (Lx)_v &= L_v x \\ &= \sum_{u \in G} L_{vu} x_u \\ &= \sum_{u \in G} (D_{vu} - A_{vu}) x_u \\ &= D_v \ x_v - \sum_{u \in \mathcal{N}(v)} x_u \end{aligned} We see that the features at each node vv are combined with the features of its immediate neighbours uN(v)u \in \mathcal{N}(v). For readers familiar with Laplacian filtering of images, this is the exact same idea. When xx is an image, x=Lxx’ = Lx is exactly the result of applying a ‘Laplacian filter’ to xx.

At this point, a natural question to ask is: How does the degree dd of the polynomial influence the behaviour of the convolution? Indeed, it is not too hard to show that: This is Lemma 5.2 from . distG(v,u)>iLvui=0. \text{dist}_G(v, u) > i \quad \Longrightarrow \quad L_{vu}^i = 0. This implies, when we convolve xx with pw(L)p_w(L) of degree dd to get xx’: xv=(pw(L)x)v=(pw(L))vx=i=0dwiLvix=i=0dwiuGLvuixu=i=0dwiuGdistG(v,u)iLvuixu. \begin{aligned} x’_v = (p_w(L)x)_v &= (p_w(L))_v x \\ &= \sum_{i = 0}^d w_i L_v^i x \\ &= \sum_{i = 0}^d w_i \sum_{u \in G} L_{vu}^i x_u \\ &= \sum_{i = 0}^d w_i \sum_{u \in G \atop \text{dist}_G(v, u) \leq i} L_{vu}^i x_u. \end{aligned}

Effectively, the convolution at node vv occurs only with nodes uu which are not more than dd hops away. Thus, these polynomial filters are localized. The degree of the localization is governed completely by dd.

To help you understand these ‘polynomial-based’ convolutions better, we have created the visualization below. Vary the polynomial coefficients and the input grid xx to see how the result xx’ of the convolution changes. The grid under the arrow shows the equivalent convolutional kernel applied at the highlighted pixel in xx to get the resulting pixel in xx’. The kernel corresponds to the row of pw(L)p_w(L) for the highlighted pixel. Note that even after adjusting for position, this kernel is different for different pixels, depending on their position within the grid.

Hover over a pixel in the input grid (left, representing xx) to highlight it and see the equivalent convolutional kernel for that pixel under the arrow. The result xx’ of the convolution is shown on the right: note that different convolutional kernels are applied at different pixels, depending on their location.

Click on the input grid to toggle pixel values between 00 (white) and 11 (blue). To randomize the input grid, press ‘Randomize Grid’. To reset all pixels to 00, press ‘Reset Grid’. Use the sliders at the bottom to change the coefficients ww. To reset all coefficients ww to 00, press ‘Reset Coefficients.’

ChebNet

ChebNet refines this idea of polynomial filters by looking at polynomial filters of the form: pw(L)=i=1dwiTi(L~) p_w(L) = \sum_{i = 1}^d w_i T_i(\tilde{L}) where TiT_i is the degree-ii Chebyshev polynomial of the first kind and L~\tilde{L} is the normalized Laplacian defined using the largest eigenvalue of LL: We discuss the eigenvalues of the Laplacian LL in more detail in a later section. L~=2Lλmax(L)In. \tilde{L} = \frac{2L}{\lambda_{\max}(L)} - I_n.

What is the motivation behind these choices?

Polynomial Filters are Node-Order Equivariant

The polynomial filters we considered here are actually independent of the ordering of the nodes. This is particularly easy to see when the degree of the polynomial pwp_w is 11: where each node’s feature is aggregated with the sum of its neighbour’s features. Clearly, this sum does not depend on the order of the neighbours. A similar proof follows for higher degree polynomials: the entries in the powers of LL are equivariant to the ordering of the nodes.

Details for the Interested Reader

As above, let’s assume an arbitrary node-order over the nn nodes of our graph. Any other node-order can be thought of as a permutation of this original node-order. We can represent any permutation by a permutation matrix PP. PP will always be an orthogonal 010-1 matrix: PPT=PTP=In. PP^T = P^TP = I_n. Then, we call an algorithm ff node-order equivariant iff for all permutations PP: f(Px)=Pf(x). f(Px) = P f(x). When switching to the new node-order using the permutation PP, the quantities below transform in the following way: xPxLPLPTLiPLiPT \begin{aligned} x &\to Px \\ L &\to PLP^T \\ L^i &\to PL^iP^T \end{aligned} and so, for the case of polynomial filters, we can see that: f(Px)=i=0dwi(PLiPT)(Px)=Pi=0dwiLix=Pf(x). \begin{aligned} f(Px) & = \sum_{i = 0}^d w_i (PL^iP^T) (Px) \\ & = P \sum_{i = 0}^d w_i L^i x’ \\ & = P f(x). \end{aligned} as needed.

Embedding Computation

We now describe how we can build a graph neural network by stacking ChebNet (or any polynomial filter) layers one after the other with non-linearities, much like a standard CNN. In particular, if we have KK different polynomial filter layers, the kthk^{\text{th}} of which has its own learnable weights w(k)w^{(k)}, we would perform the following computation:

Note that these networks reuse the same filter weights across different nodes, exactly mimicking weight-sharing in Convolutional Neural Networks (CNNs) which reuse weights for convolutional filters across a grid.

Modern Graph Neural Networks

ChebNet was a breakthrough in learning localized filters over graphs, and it motivated many to think of graph convolutions from a different perspective.

We return back to the result of convolving xx by the polynomial kernel pw(L)=Lp_w(L) = L, focussing on a particular vertex vv: (Lx)v=Lvx=uGLvuxu=uG(DvuAvu)xu=Dv xvuN(v)xu \begin{aligned} (Lx)_v &= L_v x \\ &= \sum_{u \in G} L_{vu} x_u \\ &= \sum_{u \in G} (D_{vu} - A_{vu}) x_u \\ &= D_v \ x_v - \sum_{u \in \mathcal{N}(v)} x_u \end{aligned}

As we noted before, this is a 11-hop localized convolution. But more importantly, we can think of this convolution as arising of two steps:

Key Idea: What if we consider different kinds of ‘aggregation’ and ‘combination’ steps, beyond what are possible using polynomial filters?

By ensuring that the aggregation is node-order equivariant, the overall convolution becomes node-order equivariant.

These convolutions can be thought of as ‘message-passing’ between adjacent nodes: after each step, every node receives some ‘information’ from its neighbours.

By iteratively repeating the 11-hop localized convolutions KK times (i.e., repeatedly ‘passing messages’), the receptive field of the convolution effectively includes all nodes upto KK hops away.

Embedding Computation

Message-passing forms the backbone of many GNN architectures today. We describe the most popular ones in depth below:

Thoughts

An interesting point is to assess different aggregation functions: are some better and others worse? demonstrates that aggregation functions indeed can be compared on how well they can uniquely preserve node neighbourhood features; we recommend the interested reader take a look at the detailed theoretical analysis there.

Here, we’ve talk about GNNs where the computation only occurs at the nodes. More recent GNN models such as Message-Passing Neural Networks and Graph Networks perform computation over the edges as well; they compute edge embeddings together with node embeddings. This is an even more general framework - but the same ‘message passing’ ideas from this section apply.

Interactive Graph Neural Networks

Below is an interactive visualization of these GNN models on small graphs. For clarity, the node features are just real numbers here, shown inside the squares next to each node, but the same equations hold when the node features are vectors.

Choose a GNN model using the tabs at the top. Click on a node to see the update equation at that node for the next iteration. Use the sliders on the left to change the weights for the current iteration, and watch how the update equation changes.

In practice, each iteration above is generally thought of as a single ‘neural network layer’. This ideology is followed by many popular Graph Neural Network libraries, For example: PyTorch Geometric and StellarGraph. allowing one to compose different types of graph convolutions in the same model.

From Local to Global Convolutions

The methods we’ve seen so far perform ‘local’ convolutions: every node’s feature is updated using a function of its local neighbours’ features.

While performing enough steps of message-passing will eventually ensure that information from all nodes in the graph is passed, one may wonder if there are more direct ways to perform ‘global’ convolutions.

The answer is yes; we will now describe an approach that was actually first put forward in the context of neural networks by , much before any of the GNN models we looked at above.

Spectral Convolutions

As before, we will focus on the case where nodes have one-dimensional features. After choosing an arbitrary node-order, we can stack all of the node features to get a ‘feature vector’ xRnx \in \mathbb{R}^n.

Key Idea: Given a feature vector xx, the Laplacian LL allows us to quantify how smooth xx is, with respect to GG.

How?

After normalizing xx such that i=1nxi2=1\sum_{i = 1}^n x_i^2 = 1, if we look at the following quantity involving LL: RLR_L is formally called the Rayleigh quotient. RL(x)=xTLxxTx=(i,j)E(xixj)2ixi2=(i,j)E(xixj)2. R_L(x) = \frac{x^T L x}{x^T x} = \frac{\sum_{(i, j) \in E} (x_i - x_j)^2}{\sum_i x_i^2} = \sum_{(i, j) \in E} (x_i - x_j)^2. we immediately see that feature vectors xx that assign similar values to adjacent nodes in GG (hence, are smooth) would have smaller values of RL(x)R_L(x).

LL is a real, symmetric matrix, which means it has all real eigenvalues λ1λn\lambda_1 \leq \ldots \leq \lambda_{n}. An eigenvalue λ\lambda of a matrix AA is a value satisfying the equation Au=λuAu = \lambda u for a certain vector uu, called an eigenvector. For a friendly introduction to eigenvectors, please see this tutorial. Further, the corresponding eigenvectors u1,,unu_1, \ldots, u_{n} can be taken to be orthonormal: uk1Tuk2={1 if k1=k2.0 if k1k2. u_{k_1}^T u_{k_2} = \begin{cases} 1 \quad \text{ if } {k_1} = {k_2}. \\ 0 \quad \text{ if } {k_1} \neq {k_2}. \end{cases} It turns out that these eigenvectors of LL are successively less smooth, as RLR_L indicates: This is the min-max theorem for eigenvalues. argminx, x{u1,,ui1}RL(x)=ui.minx, x{u1,,ui1}RL(x)=λi. \underset{x, \ x \perp \{u_1, \ldots, u_{i - 1}\}}{\text{argmin}} R_L(x) = u_i. \qquad \qquad \qquad \min_{x, \ x \perp \{u_1, \ldots, u_{i - 1}\}} R_L(x) = \lambda_i. The set of eigenvalues of LL are called its ‘spectrum’, hence the name! We denote the ‘spectral’ decomposition of LL as: L=UΛUT. L = U \Lambda U^T. where Λ\Lambda is the diagonal matrix of sorted eigenvalues, and UU denotes the matrix of the eigenvectors (sorted corresponding to increasing eigenvalues): Λ=[λ1λn]U=[u1  un]. \Lambda = \begin{bmatrix} \lambda_{1} & & \\ & \ddots & \\ & & \lambda_{n} \end{bmatrix} \qquad \qquad \qquad \qquad U = \begin{bmatrix} \\ u_1 \ \cdots \ u_n \\ \end{bmatrix}. The orthonormality condition between eigenvectors gives us that UTU=IU^T U = I, the identity matrix. As these nn eigenvectors form a basis for Rn\mathbb{R}^n, any feature vector xx can be represented as a linear combination of these eigenvectors: x=i=1nxi^ui=Ux^. x = \sum_{i = 1}^n \hat{x_i} u_i = U \hat{x}. where x^\hat{x} is the vector of coefficients [x0,xn][x_0, \ldots x_n]. We call x^\hat{x} as the spectral representation of the feature vector xx. The orthonormality condition allows us to state: x=Ux^UTx=x^. x = U \hat{x} \quad \Longleftrightarrow \quad U^T x = \hat{x}. This pair of equations allows us to interconvert between the ‘natural’ representation xx and the ‘spectral’ representation x^\hat{x} for any vector xRnx \in \mathbb{R}^n.

Spectral Representations of Natural Images

As discussed before, we can consider any image as a grid graph, where each pixel is a node, connected by edges to adjacent pixels. Thus, a pixel can have either 3,5,3, 5, or 88 neighbours, depending on its location within the image grid. Each pixel gets a value as part of the image. If the image is grayscale, each value will be a single real number indicating how dark the pixel is. If the image is colored, each value will be a 33-dimensional vector, indicating the values for the red, green and blue (RGB) channels. We use the alpha channel as well in the visualization below, so this is actually RGBA.

This construction allows us to compute the graph Laplacian and the eigenvector matrix UU. Given an image, we can then investigate what its spectral representation looks like.

To shed some light on what the spectral representation actually encodes, we perform the following experiment over each channel of the image independently:

Finally, we stack the resulting channels back together to get back an image. We can now see how the resulting image changes with choices of mm. Note that when m=nm = n, the resulting image is identical to the original image, as we can reconstruct each channel exactly.

Use the radio buttons at the top to chose one of the four sample images. Each of these images has been taken from the ImageNet dataset and downsampled to 5050 pixels wide and 4040 pixels tall. As there are n=50×40=2000n = 50 \times 40 = 2000 pixels in each image, there are 20002000 Laplacian eigenvectors. Use the slider at the bottom to change the number of spectral components to keep, noting how images get progressively blurrier as the number of components decrease.

As mm decreases, we see that the output image xmx_m gets blurrier. If we decrease mm to 11, the output image xmx_m is entirely the same color throughout. We see that we do not need to keep all nn components; we can retain a lot of the information in the image with significantly fewer components. We can relate this to the Fourier decomposition of images: the more eigenvectors we use, the higher frequencies we can represent on the grid.

To complement the visualization above, we additionally visualize the first few eigenvectors on a smaller 8×88 \times 8 grid below. We change the coefficients of the first 1010 out of 6464 eigenvectors in the spectral representation and see how the resulting image changes:

Move the sliders to change the spectral representation x^\hat{x} (right), and see how xx itself changes on the image (left). Note how the first eigenvectors are much ‘smoother’ than the later ones, and the many patterns we can make with only 1010 eigenvectors.

These visualizations should convince you that the first eigenvectors are indeed smooth, and the smoothness correspondingly decreases as we consider later eigenvectors.

For any image xx, we can think of the initial entries of the spectral representation x^\hat{x} as capturing ‘global’ image-wide trends, which are the low-frequency components, while the later entries as capturing ‘local’ details, which are the high-frequency components.

Embedding Computation

We now have the background to understand spectral convolutions and how they can be used to compute embeddings/feature representations of nodes.

As before, the model we describe below has KK layers: each layer kk has learnable parameters w^(k)\hat{w}^{(k)}, called the ‘filter weights’. These weights will be convolved with the spectral representations of the node features. As a result, the number of weights needed in each layer is equal to mm, the number of eigenvectors used to compute the spectral representations. We had shown in the previous section that we can take mnm \ll n and still not lose out on significant amounts of information.

Thus, convolution in the spectral domain enables the use of significantly fewer parameters than just direct convolution in the natural domain. Further, by virtue of the smoothness of the Laplacian eigenvectors across the graph, using spectral representations automatically enforces an inductive bias for neighbouring nodes to get similar representations.

Assuming one-dimensional node features for now, the output of each layer is a vector of node representations h(k)h^{(k)}, where each node’s representation corresponds to a row of the vector.

We fix an ordering of the nodes in GG. This gives us the adjacency matrix AA and the graph Laplacian LL, allowing us to compute UmU_m. Finally, we can describe the computation that the layers perform, one after the other:

The method above generalizes easily to the case where each h(k)Rdkh^{(k)} \in \mathbb{R}^{d_k}, as well: see for details.

With the insights from the previous section, we see that convolution in the spectral-domain of graphs can be thought of as the generalization of convolution in the frequency-domain of images.

Spectral Convolutions are Node-Order Equivariant

We can show spectral convolutions are node-order equivariant using a similar approach as for Laplacian polynomial filters.

Details for the Interested Reader

As in our proof before, let’s fix an arbitrary node-order. Then, any other node-order can be represented by a permutation of this original node-order. We can associate this permutation with its permutation matrix PP. Under this new node-order, the quantities below transform in the following way: xPxAPAPTLPLPTUmPUm \begin{aligned} x &\to Px \\ A &\to PAP^T \\ L &\to PLP^T \\ U_m &\to PU_m \end{aligned} which implies that, in the embedding computation: x^(PUm)T(Px)=UmTx=x^w^(PUm)T(Pw)=UmTw=w^g^g^g(PUm)g^=P(Umg^)=Pg \begin{aligned} \hat{x} &\to \left(PU_m\right)^T (Px) = U_m^T x = \hat{x} \\ \hat{w} &\to \left(PU_m\right)^T (Pw) = U_m^T w = \hat{w} \\ \hat{g} &\to \hat{g} \\ g &\to (PU_m)\hat{g} = P(U_m\hat{g}) = Pg \end{aligned} Hence, as σ\sigma is applied elementwise: f(Px)=σ(Pg)=Pσ(g)=Pf(x) f(Px) = \sigma(Pg) = P \sigma(g) = P f(x) as required. Further, we see that the spectral quantities x^,w^\hat{x}, \hat{w} and g^\hat{g} are unchanged by permutations of the nodes. Formally, they are what we would call node-order invariant.

The theory of spectral convolutions is mathematically well-grounded; however, there are some key disadvantages that we must talk about:

While spectral convolutions have largely been superseded by ‘local’ convolutions for the reasons discussed above, there is still much merit to understanding the ideas behind them. Indeed, a recently proposed GNN model called Directional Graph Networks actually uses the Laplacian eigenvectors and their mathematical properties extensively.

Global Propagation via Graph Embeddings

A simpler way to incorporate graph-level information is to compute embeddings of the entire graph by pooling node (and possibly edge) embeddings, and then using the graph embedding to update node embeddings, following an iterative scheme similar to what we have looked at here. This is an approach used by Graph Networks . We will briefly discuss how graph-level embeddings can be constructed in Pooling. However, such approaches tend to ignore the underlying topology of the graph that spectral convolutions can capture.

Learning GNN Parameters

All of the embedding computations we’ve described here, whether spectral or spatial, are completely differentiable. This allows GNNs to be trained in an end-to-end fashion, just like a standard neural network, once a suitable loss function L\mathcal{L} is defined:

The broad success of pre-training for natural language processing models such as ELMo and BERT has sparked interest in similar techniques for GNNs . The key idea in each of these papers is to train GNNs to predict local (eg. node degrees, clustering coefficient, masked node attributes) and/or global graph properties (eg. pairwise distances, masked global attributes).

Another self-supervised technique is to enforce that neighbouring nodes get similar embeddings, mimicking random-walk approaches such as node2vec and DeepWalk :

LG=vuNR(v)logexpzvTzuuexpzuTzu. L_{G} = \sum_{v} \sum_{u \in N_R(v)} \log\frac{\exp{z_v^T z_u}}{\sum\limits_{u’} \exp{z_{u’}^T z_u}}.

where NR(v)N_R(v) is a multi-set of nodes visited when random walks are started from vv. For large graphs, where computing the sum over all nodes may be computationally expensive, techniques such as Noise Contrastive Estimation are especially useful.

Conclusion and Further Reading

While we have looked at many techniques and ideas in this article, the field of Graph Neural Networks is extremely vast. We have been forced to restrict our discussion to a small subset of the entire literature, while still communicating the key ideas and design principles behind GNNs. We recommend the interested reader take a look at for a more comprehensive survey.

We end with pointers and references for additional concepts readers might be interested in:

GNNs in Practice

It turns out that accomodating the different structures of graphs is often hard to do efficiently, but we can still represent many GNN update equations using as sparse matrix-vector products (since generally, the adjacency matrix is sparse for most real-world graph datasets.) For example, the GCN variant discussed here can be represented as: h(k)=D1Ah(k1)W(k)T+h(k1)B(k)T. h^{(k)} = D^{-1} A \cdot h^{(k - 1)} {W^{(k)}}^T + h^{(k - 1)} {B^{(k)}}^T. Restructuring the update equations in this way allows for efficient vectorized implementations of GNNs on accelerators such as GPUs.

Regularization techniques for standard neural networks, such as Dropout , can be applied in a straightforward manner to the parameters (for example, zero out entire rows of W(k)W^{(k)} above). However, there are graph-specific techniques such as DropEdge that removes entire edges at random from the graph, that also boost the performance of many GNN models.

Different Kinds of Graphs

Here, we have focused on undirected graphs, to avoid going into too many unnecessary details. However, there are some simple variants of spatial convolutions for:

There do exist more sophisticated techniques that can take advantage of the different structures of these graphs: see for more discussion.

Pooling

This article discusses how GNNs compute useful representations of nodes. But what if we wanted to compute representations of graphs for graph-level tasks (for example, predicting the toxicity of a molecule)?

A simple solution is to just aggregate the final node embeddings and pass them through another neural network PREDICTG\text{PREDICT}_G: hG=PREDICTG(AGGvG({hv})) h_G = \text{PREDICT}_G \Big( \text{AGG}_{v \in G}\left(\{ h_v \} \right) \Big) However, there do exist more powerful techniques for ‘pooling’ together node representations:

Supplementary Material

Reproducing Experiments

The experiments from Spectral Representations of Natural Images can be reproduced using the following Colab Google Colaboratory notebook: Spectral Representations of Natural Images.

Recreating Visualizations

To aid in the creation of future interactive articles, we have created ObservableHQ ObservableHQ notebooks for each of the interactive visualizations here:

Acknowledgments

We are deeply grateful to ObservableHQ, a wonderful platform for developing interactive visualizations. The static visualizations would not have been possible without Inkscape and Alexander Lenail’s Neural Network SVG Generator. The molecule diagrams depicted above were obtained and modified from Wikimedia Commons, available in the public domain.

We would like to acknowledge the following Distill articles for inspiration on article design:

We would like to thank Thomas Kipf for his valuable feedback on the technical content within this article.

We would like to thank David Nichols for creating Coloring for Colorblindness which helped us improve the accessibility of this article’s color scheme.

We would also like to acknowledge CS224W: Machine Learning with Graphs as an excellent reference from which the authors benefitted significantly.

Ashish Tendulkar from Google Research India provided significant feedback on the content within this article, helping its readability. He also helped with identifying the topics this article should cover, and with brainstorming the experiments here.

Adam Pearce from Google Research helped us immensely with article hosting and rendering.

Finally, we would like to thank Anirban Santara, Sujoy Paul and Ansh Khurana from Google Research India for their help with setting up and running experiments.

Author Contributions

Ameya Daigavane drafted most of the text, designed experiments and created the interactive visualizations in this article. Balaraman Ravindran and Gaurav Aggarwal extensively guided the overall direction of the article, deliberated over the design and scope of experiments, provided much feedback on the interactive visualizations, edited the text, and described improvements to make the article more accessible to readers.

Discussion and Review

Review 1 - Chaitanya K. Joshi
Review 2 - Nick Moran
Review 3 - Anonymous

References

  1. A Gentle Introduction to Graph Neural Networks
    Sanchez-Lengeling, B., Reif, E., Pearce, A. and Wiltschko, A., 2021. Distill. DOI: 10.23915/distill.00033
  2. Graph Kernels[HTML]
    Vishwanathan, S., Schraudolph, N.N., Kondor, R. and Borgwardt, K.M., 2010. Journal of Machine Learning Research, Vol 11(40), pp. 1201-1242.
  3. Node2vec: Scalable Feature Learning for Networks[link]
    Grover, A. and Leskovec, J., 2016. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Association for Computing Machinery. DOI: 10.1145/2939672.2939754
  4. DeepWalk: Online Learning of Social Representations[link]
    Perozzi, B., Al-Rfou, R. and Skiena, S., 2014. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710. Association for Computing Machinery. DOI: 10.1145/2623330.2623732
  5. Convolutional Networks on Graphs for Learning Molecular Fingerprints[PDF]
    Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A. and Adams, R.P., 2015. Advances in Neural Information Processing Systems, Vol 28, pp. 2224-2232. Curran Associates, Inc.
  6. Neural Message Passing for Quantum Chemistry[HTML]
    Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O. and Dahl, G.E., 2017. Proceedings of the 34th International Conference on Machine Learning, Vol 70, pp. 1263-1272. PMLR.
  7. Learning Convolutional Neural Networks for Graphs
    Niepert, M., Ahmed, M. and Kutzkov, K., 2016. Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 2014–2023. JMLR.org.
  8. A Tutorial on Spectral Clustering[PDF]
    Luxburg, U.v., 2007. CoRR, Vol abs/0711.0189.
  9. Wavelets on Graphs via Spectral Graph Theory[link]
    Hammond, D.K., Vandergheynst, P. and Gribonval, R., 2011. Applied and Computational Harmonic Analysis, Vol 30(2), pp. 129 - 150. DOI: https://doi.org/10.1016/j.acha.2010.04.005
  10. Chebyshev Polynomials[link]
    Mason, J. and Handscomb, D., 2002. CRC Press.
  11. Semi-Supervised Classification with Graph Convolutional Networks[link]
    Kipf, T.N. and Welling, M., 2017. 5th International Conference on Learning Representations (ICLR) 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  12. Graph Attention Networks[link]
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P. and Bengio, Y., 2018. International Conference on Learning Representations.
  13. Inductive Representation Learning on Large Graphs[PDF]
    Hamilton, W., Ying, Z. and Leskovec, J., 2017. Advances in Neural Information Processing Systems, Vol 30, pp. 1024-1034. Curran Associates, Inc.
  14. How Powerful are Graph Neural Networks?[link]
    Xu, K., Hu, W., Leskovec, J. and Jegelka, S., 2019. International Conference on Learning Representations.
  15. Relational inductive biases, deep learning, and graph networks[PDF]
    Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V.F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Song, H.F., Ballard, A.J., Gilmer, J., Dahl, G.E., Vaswani, A., Allen, K.R., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y. and Pascanu, R., 2018. CoRR, Vol abs/1806.01261.
  16. Spectral Networks and Locally Connected Networks on Graphs[PDF]
    Bruna, J., Zaremba, W., Szlam, A. and LeCun, Y., 2014. International Conference on Learning Representations (ICLR 2014), CBLS, April 2014.
  17. ImageNet: A Large-Scale Hierarchical Image Database
    Deng, J., Dong, W., Socher, R., Li, L., Li, K. and Fei-Fei, L., 2009. CVPR09.
  18. On the Transferability of Spectral Graph Filters
    Levie, R., Isufi, E. and Kutyniok, G., 2019. 2019 13th International conference on Sampling Theory and Applications (SampTA), Vol (), pp. 1-5. DOI: 10.1109/SampTA45681.2019.9030932
  19. Directional Graph Networks
    Beaini, D., Passaro, S., Létourneau, V., Hamilton, W.L., Corso, G. and Liò, P., 2021.
  20. Deep contextualized word representations
    Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L., 2018. Proc. of NAACL.
  21. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[link]
    Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423
  22. Strategies for Pre-training Graph Neural Networks[link]
    Hu*, W., Liu*, B., Gomes, J., Zitnik, M., Liang, P., Pande, V. and Leskovec, J., 2020. International Conference on Learning Representations.
  23. Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labeled Nodes[link]
    Sun, K., Lin, Z. and Zhu, Z., 2020. The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5892-5899. AAAI Press.
  24. When Does Self-Supervision Help Graph Convolutional Networks?[PDF]
    You, Y., Chen, T., Wang, Z. and Shen, Y., 2020.
  25. Self-supervised Learning on Graphs: Deep Insights and New Direction[PDF]
    Jin, W., Derr, T., Liu, H., Wang, Y., Wang, S., Liu, Z. and Tang, J., 2020.
  26. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics[HTML]
    Gutmann, M.U. and Hyvärinen, A., 2012. Journal of Machine Learning Research, Vol 13(11), pp. 307-361.
  27. Learning word embeddings efficiently with noise-contrastive estimation[PDF]
    Mnih, A. and Kavukcuoglu, K., 2013. Advances in Neural Information Processing Systems, Vol 26, pp. 2265-2273. Curran Associates, Inc.
  28. A Comprehensive Survey on Graph Neural Networks[link]
    Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. and Yu, P.S., 2020. IEEE Transactions on Neural Networks and Learning Systems, pp. 1-21. DOI: 10.1109/TNNLS.2020.2978386
  29. Graph Neural Networks: A Review of Methods and Applications[PDF]
    Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z. and Sun, M., 2018. CoRR, Vol abs/1812.08434.
  30. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[HTML]
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Journal of Machine Learning Research, Vol 15(56), pp. 1929-1958.
  31. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification[link]
    Rong, Y., Huang, W., Xu, T. and Huang, J., 2020. International Conference on Learning Representations.
  32. An End-to-End Deep Learning Architecture for Graph Classification[link]
    Zhang, M., Cui, Z., Neumann, M. and Chen, Y., 2018. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 4438-4445. AAAI Press.
  33. Hierarchical Graph Representation Learning with Differentiable Pooling[PDF]
    Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W. and Leskovec, J., 2018. Advances in Neural Information Processing Systems, Vol 31, pp. 4800-4810. Curran Associates, Inc.
  34. Self-Attention Graph Pooling[HTML]
    Lee, J., Lee, I. and Kang, J., 2019. Proceedings of the 36th International Conference on Machine Learning, Vol 97, pp. 3734-3743. PMLR.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution in academic contexts, please cite this work as

Daigavane, et al., "Understanding Convolutions on Graphs", Distill, 2021.

BibTeX citation

@article{daigavane2021understanding,
  author = {Daigavane, Ameya and Ravindran, Balaraman and Aggarwal, Gaurav},
  title = {Understanding Convolutions on Graphs},
  journal = {Distill},
  year = {2021},
  note = {https://distill.pub/2021/understanding-gnns},
  doi = {10.23915/distill.00032}
}