On graphs and sets study:
- Permuation invariant
- Permutation equivariant
Guiding principles to synthesise and steer the design of deep learning architectures:
- Symmetries
-
$$\boldsymbol{\mathfrak{G}}$$ -Invariance -
$$\boldsymbol{\mathfrak{G}}$$ -Equivariance - Locality
- Scale Separation
The following statements encapsulate and sum up the general geometric deep learning framework.
Let
We define the following building blocks:
Linear
When we compose linear
Local pooling (coarsening)
$$
P: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C})
$$
such that
Coarsening: Sometimes it might be useful to coarse the domain. This means going from an initial
We are going to start by demonstrating the blueprint on graphs because graphs and sets give us a very nice discrete domain with minimal geometric assumptions easy to analyse.
Also with some assumptions and transformations, any domain can be seen as a graph.
Question: Why we might want to study data that lives on graphs?
Answer: Graphs are common structures around us mostly all the time. They can be thought of as sort of the main modality of data we can encounter from nature.
e.g.:
- Molecular graphs
- Transportation networks
- Social networks
“The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.” Jay Wright Forrester 1971
- Graphs without connections, called unordered sets, have only nodes and no connections between them. This structure construct a simpler domain, so it's going to be much easier to analyse the geometric architectures that arise.
- A significant part of the conclusions we achieve from sets will naturally carry over to graphs.
- Even if we restrict ourselves completely to processing data on sets following this kind of structure, it would still a relevant area. e.g: point clouds data
Assume our graph has no edges $$ \textcolor{Red}{\Omega} = \mathcal{V} \ E = \empty $$ , the set of nodes.
We typically want to deal with featurizing this domain in some way. Thus we will assume that node
Our feature space then is
$$
\textcolor{Red}{\mathcal{C}} = \R^{k}
$$
Once we have these features a typical way in which we can process data like this is by stacking the node features into a node feature matrix
NOTE: At the time we are building this feature matrix we have already introduced some notion of order. Thus we break apart our strong and initial assumption, when we assume the sets to be completely unordered. We would like a neural network that operates on
e.g:
Given an unordered set of five nodes and their features being
It's useful to think about operators that change the node order.
e.g:
This example permutation maps the following actions over the set:
$$y_1 \leftarrow x_2$$ $$y_2 \leftarrow x_4$$ $$y_3 \leftarrow x_1$$ $$y_4 \leftarrow x_3$$
Concretised in the form of a unique expression as: $$ \boldsymbol{\mathfrak{g}} = (\array{1 \mapsto 3 & 2 \mapsto 1 & 3 \mapsto 4 & 4 \mapsto}) $$
Formally, a permutation on a set
-
a bijection, one-to-one correspondence, from
$$S$$ to itself, -
a pair of linear orderings on
$$S$$ ,
A linear order on a set
-
irreflexivity:
$$x \nless x$$ , -
asymmetry: if
$$x < y$$ , then$$y \nless x$$ , -
transitivity: if
$$x < y < z$$ , then$$x < z$$ , -
comparison: if
$$x < z$$ , then$$x < y$$ or$$y < z$$ , -
connectedness: if
$$x \nless y$$ and$$y \nless x$$ , then$$x = y$$ .
- an element in the symmetric group of
$$\boldsymbol{\mathfrak{G}}$$ .
As automorphisms
Within linear algebra, each permutation defines a
Such matrices, called permutation matrices (group action
e.g: For our initial example $$ \boldsymbol{X}= \begin{bmatrix} \boldsymbol{x_1} \ \boldsymbol{x_2} \ \boldsymbol{x_3} \ \boldsymbol{x_4} \ \end{bmatrix} $$
the permutation matrix looks as follows:
$$
\boldsymbol{P}_{(2,4,1,3)} =
\begin{bmatrix}
0 & 1 & 0 & 0 \
0 & 0 & 0 & 1 \
1 & 0 & 0 & 0 \
0 & 0 & 1 & 0 \
\end{bmatrix}
$$
Its effect when left-multiplied is to permute rows of
Conclusion: This examples illustrates that the permutation matrix is exactly the representation of the group action of the of the permutation group. Thus we can use this matrix representation to reason about permutations while being in the universe of linear algebra, which is going to make all of the subsequent analysis a lot easier.
We want functions
We can compare this with our requirement in the geometric deep learning framework for
It's general procedure to perform representation learning on sets in a way that's permutation invariant. It applies a point-wise neural network
The sum aggregation is responsible to achieve the permutation invariance because independently of how we permute the rows of a matrix, its sum will be always be the same. Even though we can use other types of aggregators such as maximization or average, which indeed keep achieving permutation invariance. Thus we can generalize the previous expression to be
$$
f(\boldsymbol{X}) = \textcolor{Orange}{\phi \Bigg (}
\bigoplus_{i \in \mathcal{V}}
\textcolor{Green}{\psi(} x_i \textcolor{Green}{)}
\textcolor{Orange}{\Bigg )}
$$
Since this moment, whenever we see the generic aggregator operator, denoted by
Permutation invariant models are good for set-level outputs. This means that it's a desirable property to have when we want outputs on the level of the entire set.
Question: What if we would like answers to node level?
Answer: In order to preserve and identify node outputs we can't take use of the permutation invariant summation. Here summation would potentially destroy node-level outputs or at least make them really hard to recover.
Now rather than restrict the generated output to be resistant to permutation, we want to let the permutation produce a new and different result but while making this phenomena predictable. Therefore we want a function that doesn't change the node order when applied. This concludes that it doesn't matter when we apply the permutation to the set. $$ \boldsymbol{F}(\boldsymbol{PX}) = \boldsymbol{PF}(\boldsymbol{X}) $$ NOTE: Trying to be consistent with the linear algebra universe and its notation, the function is capitalized and in bold, because it returns a matrix as output.
Compare with GDL framework:
Linear
Strict resistance to symmetries is not the only property we want. It's desirable that our neural networks's predictions not immensely collapse under the effect of non-perfect symmetry transformation.
Data is not usually transformed using a symmetry element but rather noise it's acquired and added in the transformations, generating distortions. Then we would want the signal to be stable under slight deformations of the domain.
If we want to achieve this level of resistance under distortions, it's crucial to compose local operations to model large scales ones, as local operations won't globally propagate errors. This implies that we would want graph neural networks (GNN) to have layers which behave locally with respect to the structure of the domain.
Question: What does it mean for an equivariant layer on sets to be local?
Answer: One way we can enforce locality in equivariant set functions is through a shared function
Stacking $$\boldsymbol{h}{i}$$ in the same way we stacked the original $$\boldsymbol{x}{i}$$, this yields an new matrix, denoted by
Compare with GDL framework:
(stacking)
Conclusion: This is probably the best and most general form of a neural network that operates on sets, without assuming or inferring additional structure or relations (edges).
Now we can generalize the structure of the set of nodes, assuming they have edges between them constructing relations. Thus we consider graphs
Further information about graphs definition and properties can be found at chapter:
{% content-ref url="Graphs I.md" %} [Graphs I.md](Graphs I.md) {% endcontent-ref %}
Our representation of edges will be consistent with the realms of linear algebra. So we can represent these edges in an adjacency matrix, which is a binary matrix, denoted by
Nevertheless, we are going to limit ourselves to work without any additional information attached to edges, although all the derived math related to just edges as a binary matrix, will carry out to more complex structures. So we will just conceive every edge is just a binary indicator of whether or not two nodes are connected.
Our main desire with sets holds for graphs, since we also want our graph neural networks to be resistant to permutations over the nodes and edges which construct the graph.
### Permutation invariance & equivariance
We assume there's some connectivity structure between our nodes. Therefore when we permute the nodes, we expect the edges to accordingly act and be permuted identically. We assume that after applying the permutation we will obtain 2 isomorphic graphs.
When the nodes are permuted, we need to appropriately permute both rows and columns of adjacency matrix
- Invariance
- Equivariance
Conclusion: This two properties carry over very naturally from sets.
Locality doesn't carry over as naturally as invariance and equivariance do from sets. We now assume there might exist relations between nodes, which in fact means we musn't transform every node just in isolation. Thus graphs give us a broader context by the so-called node's neighbourhood.
Node's neighbourhood: For a node
To ensure equivariance, it's sufficiente if
$$ \boldsymbol{X}{\mathcal{N}{i}} = { { \boldsymbol{x}_a, \boldsymbol{x}_b, \boldsymbol{x}c, \boldsymbol{x}d, \boldsymbol{x}e } } $$ We have built an equivariant function on graphs by using another local permutation invariant function, learnable and denoted by $$\phi$$, acting at node level. The graph neural network computes new features representation for a node given its neighbourhood of features, denoted by $$\boldsymbol{X}{\mathcal{N}{I}}$$ , which for the sake of extra context can contain the node itself. Apply the local function $$\phi$$ corresponds to the expression: $$ \boldsymbol{h_i} = \phi(\boldsymbol{x}i, \boldsymbol{X}{\mathcal{N}{i}}) $$ We then parallelize the application of this local function in isolation to every single node conforming the graph.
Given as inputs to a graph neural network, a graph with feature inputs, denoted by
From this general procedure emerges several possible tasks:
-
$$\textcolor{Red}{Node}$$ classification: We can effectively learn a classifier acting at node level to predict likelihood node features output. $$ \boldsymbol{z}_i = f(\boldsymbol{h}_i) $$ -
$$\textcolor{Green}{Graph}$$ classification: Classifications at the level of the entire domain forces us to somehow transform our nodes features in a permutation invariant form. Generally this is achieved using a generic aggregator operator,$$\bigoplus$$ , which could be the summation or average. $$ \boldsymbol{z}G = f(\bigoplus{i \in \mathcal{V}}\boldsymbol{h}_i) $$ -
$$\textcolor{Blue}{Link}$$ prediction (edge classifier): Because now edges are an important part of our domain, we can also do predictions over edges. For example, edge classification and edge regression are factible. We can also estimate the likelihood of an edge existing between one or several nodes. We would need a function that takes under consideration both neighborhoods, the one comming from the node$$i$$ and the one from$$j$$ . In addition, we can use any possible and existent information present in their edges, denoted by $$\boldsymbol{e}{ij}$$. $$ \boldsymbol{z}{ij} = f(\boldsymbol{h}_i, \boldsymbol{h}j, \boldsymbol{e}{ij}) $$
The local permutation invariant function, denoted by $$\phi(\boldsymbol{x}i, \boldsymbol{X}{\mathcal{N}_{i}})$$, shared applied to every node, guarantees that we can in fact construct permutation equivariant functions on graphs,
- Message passing
- Diffusion
- Propagation
Whereas
- Convolutional
- Attentional
- Message-passing
As we go from 1, convolutional, to 2, message-passing, the number of possible representable functions grows. We can represent convolutional layers as a special case of attentional layers, and represent attentional layers using message passing layers. The more right we go, the more complex the problem we can fit, but at the expense of poor scalability and increasing the risk of overfitting the data. There is room for trade offs attending to our context.
We have not clearly defined a decision method to correctly anticipate how to aggregate all the neighbors in the neighborhood. The neighborhood size for a given node could hugely vary concerning the average of the other nodes of the graph. We cannot be too pragmatic about size-dependent kernels (filters), those filters whose behavior and configuration strictly depend on their input size.
Question: What do convolutional GNNs accomplish?
Answer: They quantify how meaningful an individual neighbor is relative to a specific node. To express how a particular neighbor influences a node, we are going to use weights, denoted by c_{ij}. Consequently, we are going to do a point-wise transformation to all of our node features, and then every node will compute a weighted combination of all of its neighbors based on the c_{ij} coefficients. $$ \boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}i, \bigoplus{j \in \mathcal{N}i}\textcolor{Orange}{c{ij}} \psi(\boldsymbol{x_j}) \Bigg ) $$
e.g.
We sum all our neighbors, where each one is multiplied by the corresponding weight.
The essential part of this whole procedure is that these weights were already predefined before applying the graph neural network. Thie adjacency matrix dictates these weights accordingly to the graph topology.
Widespread instances of these kinds of layers include but are not exclusive:
- ChebyNet (Defferrard et al., NeurIPS'16)
- Graph Convolutional Network (Kipf & Welling, ICLR'17): This is currently the top-cited graph neural network paper.
- Simplified Graph Convolutional (Wu et al., ICML'19): Has shown how simple these layers can become and can still recover mostly any real-world graph.
Because convolutional GNNs specify the weights before anything and are usually some function of the graph topology, these models are helpful when our graphs are homophilous.
Homophilous graphs: Graphs whose edges encode label similarities. The majority of our neighbors are expected to share the same label with high statistical significance. Thus averaging is a robust regularizer since the feature representation of a particular node will highly rely on the features of all its neighbors.
As the coefficients are predefined, this equation can be efficiently computed using sparse matrix multiplication. This explains why this type of GNNs are the most industrial ones nowadays, as they can handle large graphs with applicational value.
Given a particular neighborhood, we do not need to assume that edges necessarily encode similarity as we have done in convolutional GNNs. This precondition of defining weights between nodes in the same neighborhood cannot be generalized. It will not be suitable for learning graphs that count with more complex relations.
e.g. social network interactions
Doing a retweet of somebody else opinion does not strictly means that we might share the same thoughts. We can even wholly disagree.
More suitable graph neural network architectures exist to encapsulate more complex graph relations among nodes. They quantify node relations as a function done in a feature-dependent matter rather than a pure graph topology computation. $$ \boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}i, \bigoplus{j \in \mathcal{N}_i} \textcolor{Green}{a(}\boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Green}{)} \psi(\boldsymbol{x_j}) \Bigg ) $$
One very natural form of achieving this is by using an attention mechanism. So in convolutional graph neural networks, we previously had fixed coefficient interactions, whereas now these coefficients are computed based on the sender and receiver node features.
We have replaced the
Some trendy models tried to conceive this attention procedure idea within the graph domain:
- MoNet (Monti et al., CVPR’17)
- Graph ATtention network (Veličković et al., ICLR’18)
- GATv2 (Brody et al., 2021): It is an improved and enhanced successor of GAT, since it solves the static attention problem in graph attention networks.
We are not obligated to assume that our graph is homophilous. We can now learn a radical set of complex and new interactions between nodes. The computational cost keeps feasible, with just one scalar per edge. Thus we are able to increase our learnability bounds without sacrificing scalability. Once these new scalars have been computed, we are dealing again with the earlier introduced sparse matrix multiplication employed in the convolutional world.
Attentional architectures are the de facto standard today for acquiring and learning complex interactions. This architecture shines in those graphs, which we expect there will be a complex behavior emergence along their edges. More about this field, including the acclaimed transformer architecture, in the upcoming chapter.
They are the most generic form of graph neural network layer we can tangibly achieve. We no longer assume that our edges encode a particular weighted combination that needs to be computed. Edges give us a recipe to pass data, and the message function determines what data actually is moving to which node. We can compute arbitrary vectors, the so-called message vectors, to be sent across edges. $$ \boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}i, \bigoplus{j \in \mathcal{N}_i} \textcolor{Red}{\psi(} \boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Red}{)} \Bigg ) $$
There is a vast difference between the procedures employed in the convolutional and attentional mechanisms. We had to compute something in our neighbor and then aggregate that in a weighted combination. The neighbor decides the aggregation influence, but the receiver also alters the final aggregation.
The
There exist plenty of popular layers that suggest message passing computation:
- Interaction Nets (Battaglia et al., NeurIPS’16)
- Message Passing Neural Networks (Gilmer et al., ICML’17)
- Deepmind GraphNets (Battaglia et al., 2018)
Nevertheless, these immense expressivity capabilities do not come without expenses. These graph neural networks become harder to interpret, scale, and learn. Although, this expressivity is obliged and unavoidable in tasks linked to computational chemistry, reasoning, and simulations. These tasks necessitate edges as a recipe for passing messages. There are many parameters to be learned involved within this mechanism.
Conclusion: The set of functions representable by convolutional GNNs, is less than or equal to the ones representable by attentional, and this latest one is capable of representing less than or equal to message passing. They are in increasing order of generality.
covolutional