index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Graph Deep Learning Lab</title>
    <link>https://graphdeeplearning.github.io/</link>
      <atom:link href="https://graphdeeplearning.github.io/index.xml" rel="self" type="application/rss+xml" />
    <description>Graph Deep Learning Lab</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Xavier Bresson © 2020 · Made with &amp;hearts; by [Chaitanya Joshi](https://chaitjo.github.io/)</copyright><lastBuildDate>Thu, 28 Jun 2018 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://graphdeeplearning.github.io/images/icon_hu027d87ac1e37f4f802995042c9999554_21044_512x512_fill_lanczos_center_2.png</url>
      <title>Graph Deep Learning Lab</title>
      <link>https://graphdeeplearning.github.io/</link>
    </image>
    
    <item>
      <title>Benchmarking Graph Neural Networks</title>
      <link>https://graphdeeplearning.github.io/post/benchmarking-gnns/</link>
      <pubDate>Mon, 15 Jun 2020 03:03:47 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/post/benchmarking-gnns/</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog is based on the paper &lt;a href=&#34;https://arxiv.org/abs/2003.00982&#34;&gt;Benchmarking Graph Neural Networks&lt;/a&gt; which is a joint work with &lt;a href=&#34;https://chaitjo.github.io&#34;&gt;Chaitanya K. Joshi&lt;/a&gt;, &lt;a href=&#34;http://thomaslaurent.lmu.build/homepage.html&#34;&gt;Thomas Laurent&lt;/a&gt;, &lt;a href=&#34;https://mila.quebec/en/person/bengio-yoshua/&#34;&gt;Yoshua Bengio&lt;/a&gt; and &lt;a href=&#34;https://www.ntu.edu.sg/home/xbresson/&#34;&gt;Xavier Bresson&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;a href=&#34;https://graphdeeplearning.github.io/project/spatial-convnets/&#34;&gt;Graph Neural Networks (GNNs)&lt;/a&gt; are widely used today in diverse applications of &lt;a href=&#34;https://arxiv.org/abs/1609.02907&#34;&gt;social&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1902.06673&#34;&gt;sciences&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1703.06103&#34;&gt;knowledge&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/2005.00545&#34;&gt;graphs&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1704.01212&#34;&gt;chemistry&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/2002.09405&#34;&gt;physics&lt;/a&gt;, &lt;a href=&#34;https://infoscience.epfl.ch/record/229954?ln=en&#34;&gt;neuroscience&lt;/a&gt;, etc., and accordingly there has been a great surge of interest and growth in the number of papers in the literature.&lt;/p&gt;
&lt;p&gt;However, it has been increasingly difficult to gauge the effectiveness of new models and validate new ideas that generalize universally to larger and complex datasets &lt;strong&gt;in the absence of&lt;/strong&gt; a standard and widely-adopted &lt;strong&gt;benchmark&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;To address&lt;/strong&gt; this paramount concern existing in graph learning research, we develop an open-source, easy-to-use and reproducible &lt;a href=&#34;https://github.com/graphdeeplearning/benchmarking-gnns&#34;&gt;benchmarking framework&lt;/a&gt; with a rigorous experimental protocol that is representative of the categorical advances in GNNs.&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    This post outlines the &lt;a href=&#34;https://arxiv.org/abs/1912.09893&#34;&gt;issues&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1905.09550&#34;&gt;in&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1905.04682&#34;&gt;the&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1905.04579&#34;&gt;GNN&lt;/a&gt; literature suggesting the need of a benchmark, the framework proposed in the &lt;a href=&#34;https://arxiv.org/abs/2003.00982&#34;&gt;paper&lt;/a&gt;, the broad classes of widely used and powerful GNNs benchmarked and the insights learnt from the extensive experiments.
  &lt;/div&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;h3 id=&#34;why-benchmark&#34;&gt;Why benchmark?&lt;/h3&gt;
&lt;p&gt;In any core research or application area in &lt;a href=&#34;https://www.nature.com/articles/nature14539&#34;&gt;deep learning&lt;/a&gt;, a benchmark helps to identify and quantify what types of &lt;a href=&#34;https://arxiv.org/abs/1409.4842&#34;&gt;architectures&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1512.03385&#34;&gt;principles&lt;/a&gt;, or &lt;a href=&#34;https://arxiv.org/abs/1502.03167&#34;&gt;mechanisms&lt;/a&gt; are universal and generalizable to real-world tasks and large datasets. Particularly, the &lt;a href=&#34;https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf&#34;&gt;recent&lt;/a&gt; &lt;a href=&#34;https://cacm.acm.org/magazines/2017/6/217744-technical-perspective-what-led-computer-vision-to-deep-learning/fulltext&#34;&gt;revolution&lt;/a&gt; in this AI field is often credited, &lt;em&gt;to a possibly large extent&lt;/em&gt;, to be triggered by the large-scale benchmark image dataset, &lt;a href=&#34;http://www.image-net.org&#34;&gt;ImageNet&lt;/a&gt;. (Obviously, other driving factors include increase in the volume of research, more datasets, compute, wide-adoptance, etc.)&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;imagenet_leaderboard.png&#34; data-caption=&#34;Fig 1: ImageNet Classification Leaderboard from paperswithcode.com&#34;&gt;
&lt;img data-src=&#34;imagenet_leaderboard.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 1: ImageNet Classification Leaderboard from &lt;a href=&#34;https://paperswithcode.com&#34;&gt;paperswithcode.com&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Benchmarking has been proved to be beneficial for &lt;strong&gt;driving progress&lt;/strong&gt;, identifying &lt;strong&gt;essential ideas&lt;/strong&gt;, and solving domain-related problems in &lt;a href=&#34;https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1738-8&#34;&gt;many&lt;/a&gt; sub-fields of science. This project was conceived with this fundamental motivation.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;need-of-a-benchmarking-framework-for-gnns&#34;&gt;Need of a benchmarking framework for GNNs&lt;/h3&gt;
&lt;h4 id=&#34;a-datasets&#34;&gt;a. Datasets:&lt;/h4&gt;
&lt;p&gt;Many of the widely cited papers in the GNN literature contain experiments that are evaluated on &lt;strong&gt;small graph datasets&lt;/strong&gt; which have only a few hundreds (or, thousand) of graphs.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;tu_datasets.png&#34; data-caption=&#34;Fig 2: Statistics of the widely used TU datasets. Source Errica et al., 2020&#34;&gt;
&lt;img data-src=&#34;tu_datasets.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 2: Statistics of the widely used TU datasets. Source &lt;a href=&#34;https://openreview.net/forum?id=HygDF6NFPB&#34;&gt;Errica et al., 2020&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Take for example&lt;/strong&gt;, the ENZYMES dataset, which is almost seen in every work on a GNN for classification task. If one uses a random $10$-fold cross validation (in most papers), the test set would have $60$ graphs (i.e. $10$% of $600$ total graphs). That would mean a correct classification (or, alternatively a misclassification) would change $1.67$% of test accuracy score. &lt;strong&gt;A couple of samples could determine a $3.33$% difference in performance measure&lt;/strong&gt;, which is usually a significant gain score stated when one validates a new idea in literature. You see there, the number of samples is unreliable to concretely acknowledge the advances. &lt;sup id=&#34;fnref:1&#34;&gt;&lt;a href=&#34;#fn:1&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Our experiments, too, show that the standard deviation of performance on such datasets is large, making it difficult to make substantial conclusions on a research idea. Moreover, most GNNs perform statistically the same on these datasets.
The &lt;strong&gt;quality&lt;/strong&gt; of these datasets also leads one to question if you should use them while validating ideas on GNNs. On several of these datasets, &lt;a href=&#34;https://openreview.net/forum?id=HygDF6NFPB&#34;&gt;simpler models&lt;/a&gt;, sometimes, &lt;a href=&#34;https://openreview.net/forum?id=rJlUhhVYvS&#34;&gt;perform as good&lt;/a&gt;, or even beats GNNs.&lt;/p&gt;
&lt;p&gt;Consequently, &lt;strong&gt;it has become difficult&lt;/strong&gt; to differentiate &lt;a href=&#34;https://arxiv.org/abs/1905.09550&#34;&gt;complex&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1905.04579&#34;&gt;simple&lt;/a&gt; and &lt;a href=&#34;https://openreview.net/forum?id=HygDF6NFPB&#34;&gt;graph-agnostic&lt;/a&gt; architectures for graph machine learning.&lt;/p&gt;
&lt;!-- Most papers do not even use the same splits when comparing with an existing literature.  --&gt;
&lt;h4 id=&#34;b-consistent-experimental-protocol&#34;&gt;b. Consistent experimental protocol:&lt;/h4&gt;
&lt;p&gt;Several papers in the GNN literature do not have consensus on a unifying and robust experimental setting which leads to &lt;a href=&#34;https://arxiv.org/abs/1811.05868&#34;&gt;discussing&lt;/a&gt; the inconsistencies and &lt;a href=&#34;https://openreview.net/forum?id=HygDF6NFPB&#34;&gt;re-evaluating&lt;/a&gt; several papers&amp;rsquo; experiments.&lt;/p&gt;
&lt;p&gt;For a couple of examples to highlight here, &lt;a href=&#34;https://papers.nips.cc/paper/7729-hierarchical-graph-representation-learning-with-differentiable-pooling&#34;&gt;Ying et al., 2018&lt;/a&gt; performed training on $10$-fold split data for a fixed number of epochs and reported the performance of the epoch which has the &lt;em&gt;&amp;ldquo;highest average validation accuracy across the splits at any epoch&amp;rdquo;&lt;/em&gt; whereas &lt;a href=&#34;http://proceedings.mlr.press/v97/lee19c.html&#34;&gt;Lee et al., 2019&lt;/a&gt; used an &lt;em&gt;&amp;ldquo;early stopping criterion&amp;rdquo;&lt;/em&gt;  by monitoring the epoch-wise validation loss and report &lt;em&gt;&amp;ldquo;average test accuracy at last epoch&amp;rdquo;&lt;/em&gt; over $10$-fold split.&lt;/p&gt;
&lt;p&gt;Now, if we extract results of both these papers to put together in the same table and claim that the model with the highest performance score is the promising of all, &lt;strong&gt;can we get convinced&lt;/strong&gt; that the comparison is fair?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are other issues related to hyperparamter selection, comparison in an unfair budgets of trainable paramters, use of different train-validation-test splits, etc.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The existence of such problems pushed us to develop a GNN benchmarking framework which &lt;strong&gt;standardizes GNN research&lt;/strong&gt; and help researchers make more meaningful advances.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;challenges-of-building-a-gnn-benchmark&#34;&gt;Challenges of building a GNN benchmark&lt;/h3&gt;
&lt;p&gt;The lack of benchmarks have been a major issue in GNN literature as the &lt;strong&gt;aforementioned requirements have not been strictly enforced&lt;/strong&gt;.&lt;/p&gt;
&lt;!-- We believe that a standard and unified benchmark framework should have --

1. an easy to use and reproducible coding framework
2. rigorous and fair experimental setting
3. appropriate datasets that can statistically separate model performance
4. comprehensive in terms of the fundamental tasks (applications) that the research can be applied to.  --&gt;
&lt;p&gt;Designing benchmarks is highly challenging as we must make robust decisions for coding framework, experimental settings and appropriate datasets. The benchmark should also be comprehensive to cover most of the fundamental tasks which is indicative of the application area the research can be applied to. For instance, graph learning problems include predicting properties at the node-level, edge-level and graph-level. A benchmark should attempt to cover many, if not all, of these.&lt;/p&gt;
&lt;p&gt;Similarly, it is &lt;strong&gt;challenging to collect real and representative large-scale datasets&lt;/strong&gt;. The lack of theoretical tools that can define the quality of a dataset or, validate its statistical representativeness for a given task makes it difficult to decide on datasets. Furthermore, there are arbitrary choices required on the features of nodes and edges for graphs and the scale of graph sizes as most of the popular graph learning frameworks do not cater &lt;em&gt;‘very efficiently’&lt;/em&gt; to large graphs.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There has been a promising effort recently, &lt;a href=&#34;https://ogb.stanford.edu&#34;&gt;The Open Graph Benchmark (OGB)&lt;/a&gt;, to collect meaningful medium-to-large scale dataset in order to steer graph learning research. The initiative is complementary to the goals of this project.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h3 id=&#34;proposed-benchmarking-framework&#34;&gt;Proposed benchmarking framework:&lt;/h3&gt;
&lt;!-- We include each of the four characteristics listed out in the previous section to propose a benchmarking framework.  --&gt;
&lt;p&gt;We propose a benchmarking framework for graph neural networks with the following key characteristics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We develop a modular coding infrastructure which can be used to speed up the development of new ideas&lt;/li&gt;
&lt;li&gt;Our framework adopts a rigorous and fair experimental protocol,&lt;/li&gt;
&lt;li&gt;We propose appropriate medium-scale datasets that can be used a plug-ins for later research. &lt;sup id=&#34;fnref:2&#34;&gt;&lt;a href=&#34;#fn:2&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ol start=&#34;4&#34;&gt;
&lt;li&gt;Four fundamental tasks in graph machine learning are covered, i.e. graph classification, graph regression, node classification, and edge classification.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 id=&#34;a-coding-infrastructure&#34;&gt;a. Coding infrastructure:&lt;/h4&gt;
&lt;p&gt;Our benchmarking code infrastructure is based on &lt;a href=&#34;http://pytorch.org&#34;&gt;Pytorch&lt;/a&gt;/&lt;a href=&#34;http://dgl.ai&#34;&gt;DGL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;From a high-level view&lt;/strong&gt;, our framework unifies independent components for i) Data pipelines, ii) GNN layers and models, iii Training and evaluation functions, iv) Network and hyperparameters configurations, and v) Single execution scripts for reproducibility.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;coding_infrastructure.png&#34; data-caption=&#34;Fig 3: Snapshot of our modular coding framework open-sourced on GitHub&#34;&gt;
&lt;img data-src=&#34;coding_infrastructure.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 3: Snapshot of our modular coding framework open-sourced on &lt;a href=&#34;https://github.com/graphdeeplearning/benchmarking-gnns&#34;&gt;GitHub&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The detailed user instructions on use of each of these components is described on &lt;a href=&#34;https://github.com/graphdeeplearning/benchmarking-gnns&#34;&gt;GitHub README&lt;/a&gt;. &lt;sup id=&#34;fnref:3&#34;&gt;&lt;a href=&#34;#fn:3&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&#34;b-datasets&#34;&gt;b. Datasets:&lt;/h4&gt;
&lt;p&gt;We include 8 datasets from diverse domains of chemistry, mathematical modeling, computer vision, combinatorial optimization and social networks.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;dataset_summary.png&#34; data-caption=&#34;Fig 4: Summary statistics of the datasets included in the proposed benchmark&#34;&gt;
&lt;img data-src=&#34;dataset_summary.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 4: Summary statistics of the datasets included in the proposed benchmark
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The steps for datasets&amp;rsquo; preparation and their relevance to benchmarking graph neural networks are described in the paper.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is worth mentioning that we include &lt;a href=&#34;https://ogb.stanford.edu/docs/linkprop/#ogbl-collab&#34;&gt;OGBL-COLLAB&lt;/a&gt; from OGB which demonstrates the we can &lt;strong&gt;flexibly incorporate&lt;/strong&gt; any of the current and future datasets from the OGB initiative.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;!-- **Brief Description and Relevance**: ZINC is one of the most popular real-world molecular dataset of 250K graphs, out of which we select 12K for efficiency. The task is to regress a graph property of constrained solubility which is an important chemical property for designing generative molecules. PATTERN and CLUSTER are node classification datasets generated from Stochastic Block Models (SBMs), which are widely used to model communities in social networks. MNIST and CIFAR10 datasets are the superpixels graphs from the respective standard computer vision datasets. These datasets are important from a sanity-check perspective since we expect to perform close to 100% for MNIST and perform well enough for CIFAR10. TSP dataset is based on the traveling salesman problem, which is modeled for the edge classification task. The dataset is a collection of 2D Euclidean graphs where each graph is a TSP instance and the problem is to predict the edges that would form part of the optimal TSP tour. OGBL-COLLAB is a link prediction dataset from OGB, which is from a collaboration network dataset indexed by Microsoft Academic Graph. The task here is to predict authorship collaboration, i.e. link prediction between two authors represented by nodes. Finally, we include a small-scale dataset of Circular Skip Link Graphs. The CSL is a synthetic mathematical dataset used for the graph isomorphism problem and check expressivity of GNNs. Note that we include OGBL-COLLAB from OGB which demonstrates the we can flexibly incorporate any of the current and future datasets from the OGB initiative. --&gt;
&lt;hr&gt;
&lt;h4 id=&#34;c-experimental-protocol&#34;&gt;c. Experimental Protocol:&lt;/h4&gt;
&lt;p&gt;We define a rigorous and fair experimental protocol for benchmarking graph neural network models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dataset splits:&lt;/strong&gt; Given the literature has issues with using different train-val-test splits for different models, we make sure our data pipelines provide the same training, validation and test splits for every GNN model compared. We follow standard splits for the datasets available. For synthetic datasets with no standard splits, we ensure the class distribution or the synthetic properties are the same across the splits. Please refer to the paper on more details.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training:&lt;/strong&gt; We use the same training setup and reporting protocol for all experiments. We use the Adam optimizer to train the GNNs with a learning rate decay strategy based on the validation loss. We train each experiment for an unspecified number of epochs where the model stops to train at a minimum learning rate at which there is no significant learning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Importantly, this strategy makes it easy for the users to not fathom on choosing how many epochs to train their model for.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Each experiment is run on $4$ different seeds for a maximum of $12$ hours of training time and the summary statistics of the last epoch score of the $4$ experiments is reported.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameter budget:&lt;/strong&gt; We decide on using two trainable parameter budgets: (i) $100k$ parameters for each GNNs for all the tasks, and (ii) $500k$ parameters for GNNs for which we investigate scaling a model to larger parameters and deeper layers. The number of hidden layers and hidden dimensions are selected accordingly to match these budgets.&lt;/p&gt;
&lt;p&gt;We make this choice of having a similar parameter budget for fair comparison because it becomes otherwise difficult to rigorously evaluate different models. In GNN literature, it is often seen the a new model is compared to the existing literature without any detail of the number of parameters, or any attempt to have the same size of the model. Having said that, our goal is not to find the optimal set of hyperparameters for each of the models which is a compute-intensive task.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&#34;d-graph-neural-networks&#34;&gt;d. Graph Neural Networks:&lt;/h4&gt;
&lt;p&gt;We benchmark two broad classes of GNNs that represent the categorical advances in the architectures of a graph neural network witnessed in the most recent literature. We call the two classes, for nomenclature, as &lt;strong&gt;GCNs (Graph Convolutional Networks)&lt;/strong&gt; and &lt;strong&gt;WL-GNNs (Weisfeiler-Lehman GNNs)&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GCNs refer to the popular message-passing based GNNs which leverage sparse tensor computation and WL-GNNs are the theoretically expressive GNNs based on the WL-test to distinguish non-isomorphic graphs which require dense tensor computation at each layer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Accordingly, our experimental pipeline is shown in Fig 5 for GCNs and Fig 6 for WL-GNNs.&lt;/p&gt;
&lt;p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;mpgcns.png&#34; data-caption=&#34;Fig 5: Our standard experimental pipeline for GCNs which operate on sparse rank-$2$ tensors.&#34;&gt;
&lt;img data-src=&#34;mpgcns.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 5: Our standard experimental pipeline for GCNs which operate on &lt;em&gt;sparse&lt;/em&gt; rank-$2$ tensors.
  &lt;/figcaption&gt;


&lt;/figure&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;wlgnns.png&#34; data-caption=&#34;Fig 6: Our standard experimental pipeline for WL-GNNs which operate on dense rank-$2$ tensors.&#34;&gt;
&lt;img data-src=&#34;wlgnns.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 6: Our standard experimental pipeline for WL-GNNs which operate on &lt;em&gt;dense&lt;/em&gt; rank-$2$ tensors.
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;We direct the readers to our paper and the corresponding works for more details on the mathematical formulations of the GNNs. To interested readers, we also include in paper the &lt;strong&gt;block diagrams of layer updates&lt;/strong&gt; of each GNN benchmarked.&lt;/p&gt;
&lt;hr&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    For a quick recap at this stage, we discussed the &lt;strong&gt;need of a benchmark&lt;/strong&gt;, the &lt;strong&gt;challenges&lt;/strong&gt; in building such a framework and &lt;strong&gt;details on our proposed benchmarking framework&lt;/strong&gt;. We now delve into the experiments.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We perform a principled investigation into the message passing based GCNs and the WL-GNNs to reveal important insights and highlight critical underlying challenges in building a powerful GNN model.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;benchmarking-gnns-on-the-proposed-datasets&#34;&gt;Benchmarking GNNs on the proposed datasets.&lt;/h3&gt;
&lt;p&gt;We perform exhaustive experiments on all datasets using every GNN models included currently in our benchmarking framework. The experiments help us draw many insights, few of which are discussed here. We recommend reading the paper for details on the experimental results.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The GNNs that we benchmark are: &lt;a href=&#34;https://arxiv.org/abs/1609.02907&#34;&gt;&lt;em&gt;Vanilla&lt;/em&gt; Graph Convolutional Network (GCN)&lt;/a&gt;, &lt;a href=&#34;https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf&#34;&gt;GraphSage&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1710.10903&#34;&gt;Graph Attention Network (GAT)&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1611.08402&#34;&gt;Gaussian Mixture Model (MoNet)&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1711.07553&#34;&gt;GatedGCN&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1810.00826&#34;&gt;Graph Isomorphism Network (GIN)&lt;/a&gt;, &lt;a href=&#34;https://papers.nips.cc/paper/9718-on-the-equivalence-between-graph-isomorphism-testing-and-function-approximation-with-gnns&#34;&gt;RingGNN&lt;/a&gt; and &lt;a href=&#34;https://arxiv.org/abs/1905.11136&#34;&gt;3WL-GNN&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;1. Graph-agnostic NNs perform poorly on the proposed datasets&lt;/strong&gt;: We compare all GNNs to a simple MLP which updates each node’s features independent of one-another, i.e. ignoring the graph structure.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MLP node update equation at layer $\ell$ is:
$$
h_{i}^{\ell+1} =  \sigma \left( W^{\ell} \ h_{i}^{\ell} \right)
$$&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLP evaluates to consistently low scores on each of the datasets which shows the necessity to consider graph structure for these tasks. This result is also indicative of how appropriate these datasets are for GNN research as they statistically separate model’s performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. GCNs outperform WL-GNNs on the proposed datasets&lt;/strong&gt;: Although WL-GNNs are provably powerful in terms of graph isomorphism and expressiveness, the WL-GNN models that we consider were not able to outperform GCNs. These models are limited in scaling to larger datasets as their space/time complexity are inefficient as compared to the GCNs which leverage sparse tensors.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GCNs are seen to conveniently scale to $16$ layers and provide the best results on all datasets, whereas the WL-GNNs face loss divergence and/or out-of-memory errors when trying to build deeper networks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;3. Anisotropic mechanisms improve message-passing GCNs architectures&lt;/strong&gt;: Among the models in the message-passing GCNs, we can classify them into &lt;strong&gt;isotropic&lt;/strong&gt; and &lt;strong&gt;anisotropic&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A GCN model whose node update equation treats every edge direction equally, is considered &lt;strong&gt;isotropic&lt;/strong&gt;; and a GCN model whose node update equation treats every edge direction differently, is considered &lt;strong&gt;anisotropic&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Isotropic layer update equation:
$$
h_{i}^{\ell+1} =  \sigma \Big( W_1^{\ell} \ h_{i}^{\ell} + \sum_{j \in \mathcal{N}_i} W_2^{\ell} \ h_{j}^{\ell} \Big)
$$&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Anisotropic layer update equation:
$$
h_{i}^{\ell+1} =  \sigma \Big( W_1^{\ell} \ h_{i}^{\ell} + \sum_{j \in \mathcal{N}_i} \eta_{ij} W_2 h_{j}^{\ell} \Big)
$$&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As per the above equations, GCN, GraphSage and GIN are isotropic GCNs whereas GAT, MoNet and GatedGCN are anisotropic GCNs.&lt;/p&gt;
&lt;p&gt;Our benchmark experiments reveal that the &lt;strong&gt;anisotropic mechanism is an architectural improvement&lt;/strong&gt; in GCNs which give consistently impressive results. Note that sparse and dense attention mechanisms (in GAT and GatedGCN respectively) are examples anisotropic components in a GNN.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. There are underlying challenges for training the theoretically powerful WL-GNNs&lt;/strong&gt;: We observe a high standard deviation of performance scores on the WL-GNNs. (Recall that we report every performance of 4 runs with different seeds). This reveals &lt;strong&gt;the problem in training&lt;/strong&gt; these models.&lt;/p&gt;
&lt;p&gt;Universal training procedures like batched training and batch normalization are not used in WL-GNNs since they operate on dense rank-2 tensors.&lt;/p&gt;
&lt;p&gt;To describe this clearly, the batching approach for GCNs in leading graph machine learning libraries which operate on sparse rank-2 tensors involves preparing a &lt;strong&gt;sparse block diagonal adjacency matrix&lt;/strong&gt; for a batch of graphs.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;batching.png&#34; data-caption=&#34;Fig 7: Mini-batch graph represented with one sparse block-diagonal matrix. Source&#34;&gt;
&lt;img data-src=&#34;batching.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


  &lt;figcaption&gt;
    Fig 7: Mini-batch graph represented with one sparse block-diagonal matrix. &lt;a href=&#34;https://github.com/tkipf/gcn#graph-classification&#34;&gt;Source&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The WL-GNNs that operate on dense rank-2 tensors, have components which compute information at/from every position in the dense tensor. Therefore, the same approach (Fig 7) is not applicable as it would make the entire block diagonal matrix dense and would break sparsity.&lt;/p&gt;
&lt;p&gt;GCNs leverage batched training and hence batch normalization for stable and fast training. Besides, WL-GNNs, with the current design, are not suitable for single large graphs, eg. OGBL-COLLAB. We failed to fit the dense tensor of this large size on both GPU and CPU memory.&lt;/p&gt;
&lt;p&gt;Hence, our benchmark suggests the need for &lt;strong&gt;re-thinking&lt;/strong&gt; better design approaches for WL-GNNs which can leverage sparsity, batching, normalization schemes, etc. that have become universal ingredients in deep learning.&lt;/p&gt;
&lt;!-- **5. 3WL-GNNs perform the best among their class**: Among the models in the WL-GNN class, 3WL-GNN provides better results than its similar counterpart RingGNN. 3WL-GNN is the most theoretically powerful model of all the GNNs currently considered in our benchmarking framework. It is as powerful as 3-WL for isomorphism. The GIN models while being less expressive is able to scale better and provides overall good performance. --&gt;
&lt;hr&gt;
&lt;h3 id=&#34;more-reading&#34;&gt;More reading&lt;/h3&gt;
&lt;p&gt;With this introduction and usefulness of a GNN benchmarking framework, we conlcude this blog post, but there is more reading left if you&amp;rsquo;re interested in this work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Particularly&lt;/strong&gt;, we investigate anisotropy and edge representations for link prediction in more detail in the paper and propose a new approach for improving low-structurally expressive GCNs. &lt;em&gt;We shall discuss these in future blog posts separately for clear understanding&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;If this benchmarking framework comes to use in your research, please use the following bibtex in your work. For discussions, hit us with a query on &lt;a href=&#34;https://github.com/graphdeeplearning/benchmarking-gnns/issues&#34;&gt;GitHub Issues&lt;/a&gt;. We would love to discuss and improve the benchmark for steering more meaningful research in graph neural networks.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@article{dwivedi2020benchmarkgnns,
  title={Benchmarking Graph Neural Networks},
  author={Dwivedi, Vijay Prakash and Joshi, Chaitanya K and Laurent, Thomas and Bengio, Yoshua and Bresson, Xavier},
  journal={arXiv preprint arXiv:2003.00982},
  year={2020}
}
&lt;/code&gt;&lt;/pre&gt;
&lt;section class=&#34;footnotes&#34; role=&#34;doc-endnotes&#34;&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id=&#34;fn:1&#34; role=&#34;doc-endnote&#34;&gt;
&lt;p&gt;By this, we do not mean the ideas are not useful and/or the work put by the authors is not meaningful. Every effort equally contributes to the advance of this field. &lt;a href=&#34;#fnref:1&#34; class=&#34;footnote-backref&#34; role=&#34;doc-backlink&#34;&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&#34;fn:2&#34; role=&#34;doc-endnote&#34;&gt;
&lt;p&gt;As examples, you may refer &lt;a href=&#34;https://arxiv.org/pdf/2006.07846.pdf&#34;&gt;to&lt;/a&gt; &lt;a href=&#34;https://github.com/lukecavabarrett/pna&#34;&gt;these&lt;/a&gt; &lt;a href=&#34;https://github.com/AITRICS/mol_reliable_gnn&#34;&gt;works&lt;/a&gt; that leverage our framework to conveniently work on their research idea. It indicates the effectiveness of having such a framework. &lt;a href=&#34;#fnref:2&#34; class=&#34;footnote-backref&#34; role=&#34;doc-backlink&#34;&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&#34;fn:3&#34; role=&#34;doc-endnote&#34;&gt;
&lt;p&gt;Note that we do not aim to develop a software library, but to come up with a coding framework where each component is simple and transparent to as many users as possible. &lt;a href=&#34;#fnref:3&#34; class=&#34;footnote-backref&#34; role=&#34;doc-backlink&#34;&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description>
    </item>
    
    <item>
      <title>Learning TSP Requires Rethinking Generalization</title>
      <link>https://graphdeeplearning.github.io/publication/joshi-2020-learning/</link>
      <pubDate>Fri, 12 Jun 2020 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/joshi-2020-learning/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Benchmarking Graph Neural Networks</title>
      <link>https://graphdeeplearning.github.io/project/benchmark/</link>
      <pubDate>Tue, 03 Mar 2020 22:20:35 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/project/benchmark/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Benchmarking Graph Neural Networks</title>
      <link>https://graphdeeplearning.github.io/publication/dwivedi-2020-benchmark/</link>
      <pubDate>Mon, 02 Mar 2020 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/dwivedi-2020-benchmark/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Transformers are Graph Neural Networks</title>
      <link>https://graphdeeplearning.github.io/post/transformers-are-gnns/</link>
      <pubDate>Wed, 12 Feb 2020 16:08:39 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/post/transformers-are-gnns/</guid>
      <description>&lt;p&gt;Engineer friends often ask me: Graph Deep Learning sounds great, but are there any big commercial success stories? Is it being deployed in practical applications?&lt;/p&gt;
&lt;p&gt;Besides the obvious ones&amp;ndash;recommendation systems at &lt;a href=&#34;https://medium.com/pinterest-engineering/pinsage-a-new-graph-convolutional-neural-network-for-web-scale-recommender-systems-88795a107f48&#34;&gt;Pinterest&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1902.08730&#34;&gt;Alibaba&lt;/a&gt; and &lt;a href=&#34;https://blog.twitter.com/en_us/topics/company/2019/Twitter-acquires-Fabula-AI.html&#34;&gt;Twitter&lt;/a&gt;&amp;ndash;a slightly nuanced success story is the &lt;a href=&#34;https://arxiv.org/abs/1706.03762&#34;&gt;&lt;strong&gt;Transformer architecture&lt;/strong&gt;&lt;/a&gt;, which has &lt;a href=&#34;https://openai.com/blog/better-language-models/&#34;&gt;taken&lt;/a&gt; &lt;a href=&#34;https://www.blog.google/products/search/search-language-understanding-bert/&#34;&gt;the&lt;/a&gt; &lt;a href=&#34;https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/&#34;&gt;NLP&lt;/a&gt; &lt;a href=&#34;https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/&#34;&gt;industry&lt;/a&gt; &lt;a href=&#34;https://blog.einstein.ai/introducing-a-conditional-transformer-language-model-for-controllable-generation/&#34;&gt;by&lt;/a&gt; &lt;a href=&#34;https://nv-adlr.github.io/MegatronLM&#34;&gt;storm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Through this post, I want to establish links between &lt;a href=&#34;https://graphdeeplearning.github.io/project/spatial-convnets/&#34;&gt;Graph Neural Networks (GNNs)&lt;/a&gt; and Transformers.
I&amp;rsquo;ll talk about the intuitions behind model architectures in the NLP and GNN communities, make connections using equations and figures, and discuss how we could work together to drive progress.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s start by talking about the purpose of model architectures&amp;ndash;&lt;em&gt;representation learning&lt;/em&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;representation-learning-for-nlp&#34;&gt;Representation Learning for NLP&lt;/h3&gt;
&lt;p&gt;At a high level, all neural network architectures build &lt;em&gt;representations&lt;/em&gt; of input data as vectors/embeddings, which encode useful statistical and semantic information about the data.
These &lt;em&gt;latent&lt;/em&gt; or &lt;em&gt;hidden&lt;/em&gt; representations can then be used for performing something useful, such as classifying an image or translating a sentence.
The neural network &lt;em&gt;learns&lt;/em&gt; to build better-and-better representations by receiving feedback, usually via error/loss functions.&lt;/p&gt;
&lt;p&gt;For Natural Language Processing (NLP), conventionally, &lt;strong&gt;Recurrent Neural Networks&lt;/strong&gt; (RNNs) build representations of each word in a sentence in a sequential manner, &lt;em&gt;i.e.&lt;/em&gt;, &lt;strong&gt;one word at a time&lt;/strong&gt;.
Intuitively, we can imagine an RNN layer as a conveyor belt, with the words being processed on it &lt;em&gt;autoregressively&lt;/em&gt; from left to right.
At the end, we get a hidden feature for each word in the sentence, which we pass to the next RNN layer or use for our NLP tasks of choice.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I highly recommend Chris Olah&amp;rsquo;s legendary blog for recaps on &lt;a href=&#34;http://colah.github.io/posts/2015-08-Understanding-LSTMs/&#34;&gt;RNNs&lt;/a&gt; and &lt;a href=&#34;http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/&#34;&gt;representation learning&lt;/a&gt; for NLP.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;rnn-transf-nlp.jpg&#34; &gt;
&lt;img data-src=&#34;rnn-transf-nlp.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;p&gt;Initially introduced for machine translation, &lt;strong&gt;Transformers&lt;/strong&gt; have gradually replaced RNNs in mainstream NLP.
The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an &lt;a href=&#34;https://distill.pub/2016/augmented-rnns/&#34;&gt;attention&lt;/a&gt; &lt;a href=&#34;https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html&#34;&gt;mechanism&lt;/a&gt; to figure out how important &lt;strong&gt;all the other words&lt;/strong&gt; in the sentence are w.r.t. to the aforementioned word.
Knowing this, the word&amp;rsquo;s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential&amp;ndash;one-word-at-a-time&amp;ndash;style of processing text with RNNs. The title of the paper probably added fuel to the fire!&lt;/p&gt;
&lt;p&gt;For a recap, Yannic Kilcher made an excellent &lt;a href=&#34;https://www.youtube.com/watch?v=iDulhoQ2pro&#34;&gt;video overview&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h3 id=&#34;breaking-down-the-transformer&#34;&gt;Breaking down the Transformer&lt;/h3&gt;
&lt;p&gt;Let&amp;rsquo;s develop intuitions about the architecture by translating the previous paragraph into the language of mathematical symbols and vectors.
We update the hidden feature $h$ of the $i$&#39;th word in a sentence $\mathcal{S}$ from layer $\ell$ to layer $\ell+1$ as follows:&lt;/p&gt;
&lt;p&gt;$$
h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \ , K^{\ell} h_{j}^{\ell} \ , V^{\ell} h_{j}^{\ell} \right),
$$&lt;/p&gt;
&lt;p&gt;$$
i.e.,\ h_{i}^{\ell+1} = \sum_{j \in \mathcal{S}} w_{ij} \left( V^{\ell} h_{j}^{\ell} \right),
$$&lt;/p&gt;
&lt;p&gt;$$
\text{where} \ w_{ij} = \text{softmax}_j \left( Q^{\ell} h_{i}^{\ell} \cdot  K^{\ell} h_{j}^{\ell} \right),
$$&lt;/p&gt;
&lt;p&gt;where $j \in \mathcal{S}$ denotes the set of words in the sentence and $Q^{\ell}, K^{\ell}, V^{\ell}$ are learnable linear weights (denoting the &lt;strong&gt;Q&lt;/strong&gt;uery, &lt;strong&gt;K&lt;/strong&gt;ey and &lt;strong&gt;V&lt;/strong&gt;alue for the attention computation, respectively).
The attention mechanism is performed parallelly for each word in the sentence to obtain their updated features in &lt;em&gt;one shot&lt;/em&gt;&amp;ndash;another plus point for Transformers over RNNs, which update features word-by-word.&lt;/p&gt;
&lt;p&gt;We can understand the attention mechanism better through the following pipeline:&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;attention-block.jpg&#34; &gt;
&lt;img data-src=&#34;attention-block.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;50%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;blockquote&gt;
&lt;p&gt;Taking in the features of the word $h_{i}^{\ell}$ and the set of other words in the sentence ${ h_{j}^{\ell} ;\ \forall j \in \mathcal{S} }$, we compute the attention weights $w_{ij}$ for each pair $(i,j)$ through the dot-product, followed by a softmax across all $j$&#39;s. Finally, we produce the updated word feature $h_{i}^{\ell+1}$ for word $i$ by summing over all ${ h_{j}^{\ell} }$&#39;s weighted by their corresponding $w_{ij}$. Each word in the sentence parallelly undergoes the same pipeline to update its features.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h3 id=&#34;multi-head-attention-mechanism&#34;&gt;Multi-head Attention mechanism&lt;/h3&gt;
&lt;p&gt;Getting this dot-product attention mechanism to work proves to be tricky&amp;ndash;bad random initializations can de-stabilize the learning process.
We can overcome this by parallelly performing multiple &amp;lsquo;heads&amp;rsquo; of attention and concatenating the result (with each head now having separate learnable  weights):&lt;/p&gt;
&lt;p&gt;$$
h_{i}^{\ell+1} = \text{Concat} \left( \text{head}_1, \ldots, \text{head}_K \right) O^{\ell},
$$
$$
\text{head}_k = \text{Attention} \left(  Q^{k,\ell} h_{i}^{\ell} \ , K^{k, \ell} h_{j}^{\ell} \ , V^{k, \ell} h_{j}^{\ell} \right),
$$&lt;/p&gt;
&lt;p&gt;where $Q^{k,\ell}, K^{k,\ell}, V^{k,\ell}$ are the learnable weights of the $k$&#39;th attention head and $O^{\ell}$ is a down-projection to match the dimensions of $h_i^{\ell+1}$ and $h_i^{\ell}$ across layers.&lt;/p&gt;
&lt;p&gt;Multiple heads allow the attention mechanism to essentially &amp;lsquo;hedge its bets&amp;rsquo;, looking at different transformations or aspects of the hidden features from the previous layer.
We&amp;rsquo;ll talk more about this later.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;scale-issues-and-the-feed-forward-sub-layer&#34;&gt;Scale issues and the Feed-forward sub-layer&lt;/h3&gt;
&lt;p&gt;A key issue motivating the final Transformer architecture is that the features for words &lt;em&gt;after&lt;/em&gt; the attention mechanism might be at &lt;strong&gt;different scales&lt;/strong&gt; or &lt;strong&gt;magnitudes&lt;/strong&gt;:
(1) This can be due to some words having very sharp or very distributed attention weights $w_{ij}$ when summing over the features of the other words.
(2) At the individual feature/vector entries level, concatenating across multiple attention heads&amp;ndash;each of which might output values at different scales&amp;ndash;can lead to the entries of the final vector $h_{i}^{\ell+1}$ having a wide range of values.
Following conventional ML wisdom, it seems reasonable to add a &lt;a href=&#34;https://nealjean.com/ml/neural-network-normalization/&#34;&gt;normalization layer&lt;/a&gt; into the pipeline.&lt;/p&gt;
&lt;p&gt;Transformers overcome issue (2) with &lt;a href=&#34;https://arxiv.org/abs/1607.06450&#34;&gt;&lt;strong&gt;LayerNorm&lt;/strong&gt;&lt;/a&gt;, which normalizes and learns an affine transformation at the feature level.
Additionally, &lt;strong&gt;scaling the dot-product&lt;/strong&gt; attention by the square-root of the feature dimension helps counteract issue (1).&lt;/p&gt;
&lt;p&gt;Finally, the authors propose another &amp;lsquo;trick&amp;rsquo; to control the scale issue: &lt;strong&gt;a position-wise 2-layer MLP&lt;/strong&gt; with a special structure.
After the multi-head attention, they project $h_i^{\ell+1}$ to a (absurdly) higher dimension by a learnable weight, where it undergoes the ReLU non-linearity, and is then projected back to its original dimension followed by another normalization:&lt;/p&gt;
&lt;p&gt;$$
h_i^{\ell+1} = \text{LN} \left( \text{MLP} \left( \text{LN} \left( h_i^{\ell+1} \right) \right) \right)
$$&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To be honest, I&amp;rsquo;m not sure what the exact intuition behind the over-parameterized feed-forward sub-layer was and nobody seems to be asking questions about it, too! I suppose LayerNorm and scaled dot-products didn&amp;rsquo;t completely solve the issues highlighted, so the big MLP is a sort of hack to re-scale the feature vectors independently of each other.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;mailto:chaitanya.joshi@ntu.edu.sg&#34;&gt;Email me&lt;/a&gt; if you know more!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;The final picture of a Transformer layer looks like this:&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;transformer-block.png&#34; &gt;
&lt;img data-src=&#34;transformer-block.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;60%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;p&gt;The Transformer architecture is also extremely amenable to very deep networks, enabling the NLP community to &lt;em&gt;&lt;a href=&#34;https://arxiv.org/abs/1910.10683&#34;&gt;scale&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/2001.08361&#34;&gt;up&lt;/a&gt;&lt;/em&gt; in terms of both model parameters and, by extension, data.
&lt;strong&gt;Residual connections&lt;/strong&gt; between the inputs and outputs of each multi-head attention sub-layer and the feed-forward sub-layer are key for stacking Transformer layers (but omitted from the diagram for clarity).&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;gnns-build-representations-of-graphs&#34;&gt;GNNs build representations of graphs&lt;/h3&gt;
&lt;p&gt;Let&amp;rsquo;s take a step away from NLP for a moment.&lt;/p&gt;
&lt;p&gt;Graph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) build representations of nodes and edges in graph data.
They do so through &lt;strong&gt;neighbourhood aggregation&lt;/strong&gt; (or message passing), where each node gathers features from its neighbours to update its representation of the &lt;em&gt;local&lt;/em&gt; graph structure around it.
Stacking several GNN layers enables the model to propagate each node&amp;rsquo;s features over the entire graph&amp;ndash;from its neighbours to the neighbours&amp;rsquo; neighbours, and so on.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;gnn-social-network.jpg&#34; &gt;
&lt;img data-src=&#34;gnn-social-network.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;blockquote&gt;
&lt;p&gt;Take the example of this emoji social network: The node features produced by the GNN can be used for predictive tasks such as identifying the most influential members or proposing potential connections.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In their most basic form, GNNs update the hidden features $h$ of node $i$ (for example, 😆) at layer $\ell$ via a non-linear transformation of the node&amp;rsquo;s own features $h_i^{\ell}$ added to the aggregation of features $h_j^{\ell}$ from each neighbouring node $j \in \mathcal{N}(i)$:&lt;/p&gt;
&lt;p&gt;$$
h_{i}^{\ell+1} =  \sigma \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}(i)} \left( V^{\ell} h_{j}^{\ell} \right)  \Big),
$$&lt;/p&gt;
&lt;p&gt;where $U^{\ell}, V^{\ell}$ are learnable weight matrices of the GNN layer and $\sigma$ is a non-linearity such as ReLU.
In the example, $\mathcal{N}$(😆) $=$ { 😘, 😎, 😜, 🤩 }.&lt;/p&gt;
&lt;p&gt;The summation over the neighbourhood nodes $j \in \mathcal{N}(i)$ can be replaced by other input size-invariant &lt;strong&gt;aggregation functions&lt;/strong&gt; such as simple mean/max or something more powerful, such as a weighted sum via an &lt;a href=&#34;https://petar-v.com/GAT/&#34;&gt;&lt;strong&gt;attention mechanism&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Does that sound familiar?&lt;/p&gt;
&lt;p&gt;Maybe a pipeline will help make the connection:&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;gnn-block.jpg&#34; &gt;
&lt;img data-src=&#34;gnn-block.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;50%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    If we were to do multiple parallel heads of neighbourhood aggregation and replace summation over the neighbours $j$ with the attention mechanism, &lt;em&gt;i.e.&lt;/em&gt;, a weighted sum, we&amp;rsquo;d get the &lt;b&gt;Graph Attention Network&lt;/b&gt; (GAT). Add normalization and the feed-forward MLP, and voila, we have a &lt;b&gt;Graph Transformer&lt;/b&gt;!
  &lt;/div&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;h3 id=&#34;sentences-are-fully-connected-word-graphs&#34;&gt;Sentences are fully-connected word graphs&lt;/h3&gt;
&lt;p&gt;To make the connection more explicit, consider a sentence as a fully-connected graph, where each word is connected to every other word.
Now, we can use a GNN to build features for each node (word) in the graph (sentence), which we can then perform NLP tasks with.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;gnn-nlp.jpg&#34; &gt;
&lt;img data-src=&#34;gnn-nlp.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;90%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;p&gt;Broadly, this is what Transformers are doing: they are &lt;strong&gt;GNNs with multi-head attention&lt;/strong&gt; as the neighbourhood aggregation function.
Whereas standard GNNs aggregate features from their local neighbourhood nodes $j \in \mathcal{N}(i)$,
Transformers for NLP treat the entire sentence $\mathcal{S}$ as the local neighbourhood, aggregating features from each word $j \in \mathcal{S}$ at each layer.&lt;/p&gt;
&lt;p&gt;Importantly, various problem-specific tricks&amp;ndash;such as position encodings, causal/masked aggregation, learning rate schedules and extensive pre-training&amp;ndash;are essential for the success of Transformers but seldom seem in the GNN community.
At the same time, looking at Transformers from a GNN perspective could inspire us to get rid of a lot of the &lt;em&gt;bells and whistles&lt;/em&gt; in the architecture.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;what-can-we-learn-from-each-other&#34;&gt;What can we learn from each other?&lt;/h3&gt;
&lt;p&gt;Now that we&amp;rsquo;ve established a connection between Transformers and GNNs, let me throw some ideas around&amp;hellip;&lt;/p&gt;
&lt;h4 id=&#34;are-fully-connected-graphs-the-best-input-format-for-nlp&#34;&gt;Are fully-connected graphs the best input format for NLP?&lt;/h4&gt;
&lt;p&gt;Before statistical NLP and ML, linguists like Noam Chomsky focused on developing fomal theories of &lt;a href=&#34;https://en.wikipedia.org/wiki/Syntactic_Structures&#34;&gt;linguistic structure&lt;/a&gt;, such as &lt;strong&gt;syntax trees/graphs&lt;/strong&gt;.
&lt;a href=&#34;https://arxiv.org/abs/1503.00075&#34;&gt;Tree LSTMs&lt;/a&gt; already tried this, but maybe Transformers/GNNs are better architectures for bringing the world of linguistic theory and statistical NLP closer?&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;syntax-tree.png&#34; &gt;
&lt;img data-src=&#34;syntax-tree.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;40%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;how-to-learn-long-term-dependencies&#34;&gt;How to learn long-term dependencies?&lt;/h4&gt;
&lt;p&gt;Another issue with fully-connected graphs is that they make learning very long-term dependencies between words difficult.
This is simply due to how the number of edges in the graph &lt;strong&gt;scales quadratically&lt;/strong&gt; with the number of nodes, &lt;em&gt;i.e.&lt;/em&gt;, in an $n$ word sentence, a Transformer/GNN would be doing computations over $n^2$ pairs of words. Things get out of hand for very large $n$.&lt;/p&gt;
&lt;p&gt;The NLP community&amp;rsquo;s perspective on the long sequences and dependencies problem is interesting:
Making the attention mechanism &lt;a href=&#34;https://openai.com/blog/sparse-transformer/&#34;&gt;sparse&lt;/a&gt; or &lt;a href=&#34;https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/&#34;&gt;adaptive&lt;/a&gt; in terms of input size, adding &lt;a href=&#34;https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html&#34;&gt;recurrence&lt;/a&gt; or &lt;a href=&#34;https://deepmind.com/blog/article/A_new_model_and_dataset_for_long-range_memory&#34;&gt;compression&lt;/a&gt; into each layer,
and using &lt;a href=&#34;https://www.pragmatic.ml/reformer-deep-dive/&#34;&gt;Locality Sensitive Hashing&lt;/a&gt; for efficient attention
are all promising new ideas for better Transformers.&lt;/p&gt;
&lt;p&gt;It would be interesting to see ideas from the GNN community thrown into the mix, &lt;em&gt;e.g.&lt;/em&gt;, &lt;a href=&#34;https://arxiv.org/abs/1911.04070&#34;&gt;Binary Partitioning&lt;/a&gt; for sentence &lt;strong&gt;graph sparsification&lt;/strong&gt; seems like another exciting approach.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;long-term-depend.png&#34; &gt;
&lt;img data-src=&#34;long-term-depend.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;80%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;are-transformers-learning-neural-syntax&#34;&gt;Are Transformers learning &amp;lsquo;neural syntax&amp;rsquo;?&lt;/h4&gt;
&lt;p&gt;There have been &lt;a href=&#34;https://pair-code.github.io/interpretability/bert-tree/&#34;&gt;several&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1905.05950&#34;&gt;interesting&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1906.04341&#34;&gt;papers&lt;/a&gt; from the NLP community on what Transformers might be learning.
The basic premise is that performing attention on all word pairs in a sentence&amp;ndash;with the purpose of identifying which pairs are the most interesting&amp;ndash;enables Transformers to learn something like a &lt;strong&gt;task-specific syntax&lt;/strong&gt;.
Different heads in the multi-head attention might also be &amp;lsquo;looking&amp;rsquo; at different syntactic properties.&lt;/p&gt;
&lt;p&gt;In graph terms, by using GNNs on full graphs, can we recover the most important edges&amp;ndash;and what they might entail&amp;ndash;from how the GNN performs neighbourhood aggregation at each layer?
I&amp;rsquo;m &lt;a href=&#34;https://arxiv.org/abs/1909.07913&#34;&gt;not so convinced&lt;/a&gt; by this view yet.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;attention-heads.png&#34; &gt;
&lt;img data-src=&#34;attention-heads.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;why-multiple-heads-of-attention-why-attention&#34;&gt;Why multiple heads of attention? Why attention?&lt;/h4&gt;
&lt;p&gt;I&amp;rsquo;m more sympathetic to the optimization view of the multi-head mechanism&amp;ndash;having multiple attention heads &lt;strong&gt;improves learning&lt;/strong&gt; and overcomes &lt;strong&gt;bad random initializations&lt;/strong&gt;.
For instance, &lt;a href=&#34;https://lena-voita.github.io/posts/acl19_heads.html&#34;&gt;these&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1905.10650&#34;&gt;papers&lt;/a&gt; showed that Transformer heads can be &amp;lsquo;pruned&amp;rsquo; or removed &lt;em&gt;after&lt;/em&gt; training without significant performance impact.&lt;/p&gt;
&lt;p&gt;Multi-head neighbourhood aggregation mechanisms have also proven effective in GNNs, &lt;em&gt;e.g.&lt;/em&gt;, GAT uses the same multi-head attention and &lt;a href=&#34;https://arxiv.org/abs/1611.08402&#34;&gt;MoNet&lt;/a&gt; uses multiple &lt;em&gt;Gaussian kernels&lt;/em&gt; for aggregating features.
Although invented to stabilize attention mechanisms, could the multi-head trick become standard for squeezing out extra model performance?&lt;/p&gt;
&lt;p&gt;Conversely, GNNs with simpler aggregation functions such as sum or max do not require multiple aggregation heads for stable training.
Wouldn&amp;rsquo;t it be nice for Transformers if we didn&amp;rsquo;t have to compute pair-wise compatibilities between each word pair in the sentence?&lt;/p&gt;
&lt;p&gt;Could Transformers benefit from ditching attention, altogether? Yann Dauphin and collaborators&amp;rsquo; &lt;a href=&#34;https://arxiv.org/abs/1705.03122&#34;&gt;recent&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1901.10430&#34;&gt;work&lt;/a&gt; suggests an alternative &lt;strong&gt;ConvNet architecture&lt;/strong&gt;.
Transformers, too, might ultimately be doing &lt;a href=&#34;http://jbcordonnier.com/posts/attention-cnn/&#34;&gt;something&lt;/a&gt; &lt;a href=&#34;https://twitter.com/ChrSzegedy/status/1232148457810538496&#34;&gt;similar&lt;/a&gt; to ConvNets!&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;attention-conv.png&#34; &gt;
&lt;img data-src=&#34;attention-conv.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;why-is-training-transformers-so-hard&#34;&gt;Why is training Transformers so hard?&lt;/h4&gt;
&lt;p&gt;Reading new Transformer papers makes me feel that training these models requires something akin to &lt;em&gt;black magic&lt;/em&gt; when determining the best &lt;strong&gt;learning rate schedule, warmup strategy&lt;/strong&gt; and &lt;strong&gt;decay settings&lt;/strong&gt;.
This could simply be because the models are so huge and the NLP tasks studied are so challenging.&lt;/p&gt;
&lt;p&gt;But &lt;a href=&#34;https://arxiv.org/abs/1906.01787&#34;&gt;recent&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/1910.06764&#34;&gt;results&lt;/a&gt; &lt;a href=&#34;https://arxiv.org/abs/2002.04745&#34;&gt;suggest&lt;/a&gt; that it could also be due to the specific permutation of normalization and residual connections within the architecture.&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34;&gt;&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;I enjoyed reading the new &lt;a href=&#34;https://twitter.com/DeepMind?ref_src=twsrc%5Etfw&#34;&gt;@DeepMind&lt;/a&gt; Transformer paper, but why is training these models such dark magic? &amp;quot;For word-based LM we used 16, 000 warmup steps with 500, 000 decay steps and sacrifice 9,000 goats.&amp;quot;&lt;a href=&#34;https://t.co/dP49GTa4ze&#34;&gt;https://t.co/dP49GTa4ze&lt;/a&gt; &lt;a href=&#34;https://t.co/1K3Fx4s3M8&#34;&gt;pic.twitter.com/1K3Fx4s3M8&lt;/a&gt;&lt;/p&gt;&amp;mdash; Chaitanya Joshi (@chaitjo) &lt;a href=&#34;https://twitter.com/chaitjo/status/1229335421806501888?ref_src=twsrc%5Etfw&#34;&gt;February 17, 2020&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;	
&lt;p&gt;At this point I&amp;rsquo;m ranting, but this makes me sceptical: Do we really need multiple heads of expensive pair-wise attention, overparameterized MLP sub-layers, and complicated learning schedules?&lt;/p&gt;
&lt;p&gt;Do we really need massive models with &lt;a href=&#34;https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/&#34;&gt;massive carbon footprints&lt;/a&gt;?&lt;/p&gt;
&lt;p&gt;Shouldn&amp;rsquo;t architectures with good &lt;a href=&#34;https://arxiv.org/abs/1806.01261&#34;&gt;inductive biases&lt;/a&gt; for the task at hand be easier to train?&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;further-reading&#34;&gt;Further Reading&lt;/h3&gt;
&lt;p&gt;To dive deep into the Transformer architecture from an NLP perspective, check out these amazing blog posts: &lt;a href=&#34;http://jalammar.github.io/illustrated-transformer/&#34;&gt;The Illustrated Transformer&lt;/a&gt; and &lt;a href=&#34;http://nlp.seas.harvard.edu/2018/04/03/attention.html&#34;&gt;The Annotated Transformer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also, this blog isn&amp;rsquo;t the first to link GNNs and Transformers:
Here&amp;rsquo;s &lt;a href=&#34;https://ipam.wistia.com/medias/1zgl4lq6nh&#34;&gt;an excellent talk&lt;/a&gt; by Arthur Szlam on the history and connection between Attention/Memory Networks, GNNs and Transformers.
Similarly, DeepMind&amp;rsquo;s &lt;a href=&#34;https://arxiv.org/abs/1806.01261&#34;&gt;star-studded position paper&lt;/a&gt; introduces the &lt;em&gt;Graph Networks&lt;/em&gt; framework, unifying all these ideas.
For a code walkthrough, the DGL team has &lt;a href=&#34;https://docs.dgl.ai/en/latest/tutorials/models/4_old_wines/7_transformer.html&#34;&gt;a nice tutorial&lt;/a&gt; on seq2seq as a graph problem and building Transformers as GNNs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In our next post, we&amp;rsquo;ll be doing the reverse: using GNN architectures as Transformers for NLP (based on the Transformers library by &lt;a href=&#34;https://github.com/huggingface/transformers&#34;&gt;🤗 HuggingFace&lt;/a&gt;).&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Finally, we wrote &lt;a href=&#34;https://graphdeeplearning.github.io/publication/xu-2019-multi/&#34;&gt;a recent paper&lt;/a&gt; applying Transformers to sketch graphs. Do check it out!&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&#34;updates&#34;&gt;Updates&lt;/h4&gt;
&lt;p&gt;The post is also available on &lt;a href=&#34;https://medium.com/@chaitjo/transformers-are-graph-neural-networks-bca9f75412aa?source=friends_link&amp;amp;sk=c54de873b2cec3db70166a6cf0b41d3e&#34;&gt;Medium&lt;/a&gt;, and has been translated to &lt;a href=&#34;https://mp.weixin.qq.com/s/DABEcNf1hHahlZFMttiT2g&#34;&gt;Chinese&lt;/a&gt; and &lt;a href=&#34;https://habr.com/ru/post/491576/&#34;&gt;Russian&lt;/a&gt;.
Do join the discussion on &lt;a href=&#34;https://twitter.com/chaitjo/status/1233220586358181888?s=20&#34;&gt;Twitter&lt;/a&gt;, &lt;a href=&#34;https://www.reddit.com/r/MachineLearning/comments/fb86mo/d_transformers_are_graph_neural_networks_blog/&#34;&gt;Reddit&lt;/a&gt; or &lt;a href=&#34;https://news.ycombinator.com/item?id=22518263&#34;&gt;HackerNews&lt;/a&gt;!&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34;&gt;&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;Transformers are a special case of Graph Neural Networks. This may be obvious to some, but the following blog post does a good job at explaining these important concepts. &lt;a href=&#34;https://t.co/H8LT2F7LqC&#34;&gt;https://t.co/H8LT2F7LqC&lt;/a&gt;&lt;/p&gt;&amp;mdash; Oriol Vinyals (@OriolVinyalsML) &lt;a href=&#34;https://twitter.com/OriolVinyalsML/status/1233783593626951681?ref_src=twsrc%5Etfw&#34;&gt;February 29, 2020&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
</description>
    </item>
    
    <item>
      <title>Free-hand Sketches</title>
      <link>https://graphdeeplearning.github.io/project/sketches/</link>
      <pubDate>Tue, 14 Jan 2020 16:19:27 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/project/sketches/</guid>
      <description>&lt;h2 id=&#34;representation-learning-for-sketches&#34;&gt;Representation Learning for Sketches&lt;/h2&gt;
&lt;p&gt;Human beings have been creating free-hand sketches, &lt;em&gt;i.e.&lt;/em&gt;, drawings without precise instruments, since &lt;a href=&#34;https://en.wikipedia.org/wiki/Cave_painting&#34;&gt;time immemorial&lt;/a&gt;.
Due to the popularity of touchscreen interfaces, machine learning using sketches has emerged as an interesting problem with a myriad of applications:
If we consider sketches as 2D images, we can throw them into off-the-shelf &lt;a href=&#34;https://arxiv.org/abs/1501.07873&#34;&gt;Convolutional Neural Networks (CNNs)&lt;/a&gt;.
While CNNs are designed for &lt;em&gt;static&lt;/em&gt; collections of pixels with &lt;em&gt;dense&lt;/em&gt; colors and textures,
sketches are usually an extremely &lt;em&gt;sparse&lt;/em&gt; sequences of strokes which capture high-level abstractions and ideas. &lt;a href=&#34;https://ai.googleblog.com/2017/04/teaching-machines-to-draw.html&#34;&gt;Recurrent Neural Networks (RNNs)&lt;/a&gt; stick out as a natural architecture for capturing this temporal nature of sketches.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Structure vs. temporal order: can we have the best of both worlds?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&#34;sketches-as-graphs&#34;&gt;Sketches as Graphs&lt;/h2&gt;
&lt;p&gt;We are working on a novel representation of free-hand sketches as &lt;strong&gt;sparsely-connected graphs&lt;/strong&gt;.
We assume that sketches are sets of curves and strokes, which are discretized by a set of points representing the graph nodes.
Each node encodes spatial, temporal and semantic information.
Thus, representing sketches with graphs offers a universal representation that can make use of both the sketch structure (like images) as well as temporal information (like stroke sequences).
To exploit these graph structures, we are developing &lt;strong&gt;Graph Neural Networks (GNNs)&lt;/strong&gt; based on the Transformer model &lt;a href=&#34;https://arxiv.org/abs/1706.03762&#34;&gt;[&lt;em&gt;Vaswani et al.&lt;/em&gt;, 2017]&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Multi-Graph Transformer for Free-Hand Sketch Recognition</title>
      <link>https://graphdeeplearning.github.io/publication/xu-2019-multi/</link>
      <pubDate>Tue, 24 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/xu-2019-multi/</guid>
      <description></description>
    </item>
    
    <item>
      <title>A Two-Step Graph Convolutional Decoder for Molecule Generation</title>
      <link>https://graphdeeplearning.github.io/publication/bresson-2019-two/</link>
      <pubDate>Sun, 01 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/bresson-2019-two/</guid>
      <description></description>
    </item>
    
    <item>
      <title>On Learning Paradigms for the Travelling Salesman Problem</title>
      <link>https://graphdeeplearning.github.io/publication/joshi-2019-learning/</link>
      <pubDate>Sun, 01 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/joshi-2019-learning/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Graph Neural Networks for the Travelling Salesman Problem</title>
      <link>https://graphdeeplearning.github.io/talk/informs-oct2019/</link>
      <pubDate>Tue, 22 Oct 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/talk/informs-oct2019/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Graph Convolutional Neural Networks for Molecule Generation</title>
      <link>https://graphdeeplearning.github.io/talk/ipam-sept2019/</link>
      <pubDate>Mon, 23 Sep 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/talk/ipam-sept2019/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Combinatorial Optimization</title>
      <link>https://graphdeeplearning.github.io/project/combinatorial-optimization/</link>
      <pubDate>Tue, 17 Sep 2019 22:20:35 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/project/combinatorial-optimization/</guid>
      <description>&lt;h2 id=&#34;operations-research-and-combinatorial-problems&#34;&gt;Operations Research and Combinatorial Problems&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Operations_research&#34;&gt;Operations Research (OR)&lt;/a&gt;&lt;/strong&gt; started in the first world war as an initiative to use mathematics and computer science to assist military planners in their decisions.
Today, combinatorial optimization algorithms developed in the OR community form the backbone of the most important modern industries including transportation, logistics, scheduling, finance and supply chains.&lt;/p&gt;
&lt;p&gt;OR Problems are formulated as integer constrained optimization, &lt;em&gt;i.e.&lt;/em&gt;, with integral or binary variables (called decision variables).
While not all such problems are hard to solve (&lt;em&gt;e.g.&lt;/em&gt;, finding the shortest path between two locations), we concentrate on &lt;strong&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Combinatorial_optimization&#34;&gt;Combinatorial (NP-Hard) problems&lt;/a&gt;&lt;/strong&gt;.
NP-Hard problems are &lt;em&gt;impossible&lt;/em&gt; to solve optimally at large scales as exhaustively searching for their solutions is beyond the limits of modern computers.
The &lt;a href=&#34;https://en.wikipedia.org/wiki/Travelling_salesman_problem&#34;&gt;Travelling Salesman Problem (TSP)&lt;/a&gt; and the &lt;a href=&#34;https://en.wikipedia.org/wiki/Minimum_spanning_tree&#34;&gt;Minimum Spanning Tree Problem (MST)&lt;/a&gt; are two of the most popular examples for such problems defined using graphs.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;tsp-gif.gif&#34; alt=&#34;TSP GIF&#34;&gt;
&lt;em&gt;TSP asks the following question: Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city? Formally, given a graph, one needs to search the space of permutations to find an optimal sequence of nodes, called a tour, with minimal total edge weights (tour length).&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;neural-combinatorial-optimization&#34;&gt;Neural Combinatorial Optimization&lt;/h2&gt;
&lt;p&gt;Solvers and heuristic algorithms developed in the OR community are able to solve classical problems such as TSP with up to millions of variables.
However, designing powerful and robust optimization algorithms requires significant &lt;strong&gt;specialized knowledge&lt;/strong&gt; and years of &lt;strong&gt;trial-and-error&lt;/strong&gt;, especially for understudied but high-impact problems arising in &lt;a href=&#34;https://arxiv.org/abs/2003.11755&#34;&gt;scientific discovery&lt;/a&gt; or &lt;a href=&#34;https://arxiv.org/abs/1911.05289&#34;&gt;computer architecture&lt;/a&gt;.
The state-of-the-art TSP solver, &lt;a href=&#34;http://www.math.uwaterloo.ca/tsp/concorde.html&#34;&gt;Concorde&lt;/a&gt;, leverages &lt;a href=&#34;https://www.youtube.com/watch?v=q8nQTNvCrjE&#34;&gt;over 50 years of research&lt;/a&gt; on linear programming, cutting plane algorithms and branch-and-bound.&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    At our lab, we&amp;rsquo;re working on &lt;strong&gt;automating&lt;/strong&gt; and &lt;strong&gt;augmenting&lt;/strong&gt; such expert intuition through Machine Learning [&lt;a href=&#34;https://arxiv.org/abs/1811.06128&#34;&gt;Bengio &lt;em&gt;et al.&lt;/em&gt;, 2018&lt;/a&gt;].
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Since most problems are highly structured, heuristics take the form of rules or policies to make sequential decisions, &lt;em&gt;e.g.&lt;/em&gt;, determine the TSP tour one city at a time.
Our research uses deep neural networks to parameterize these policies and train them directly from problem instances.
In particular, &lt;strong&gt;Graph Neural Networks&lt;/strong&gt; are the perfect fit for the task because they naturally operate on the graph structure of these problems.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;pipeline.png&#34; alt=&#34;End-to-end pipeline&#34;&gt;
&lt;em&gt;A generic five-stage pipeline for end-to-end learning of combinatorial problems on graphs&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-study-tsp-in-particular&#34;&gt;Why study TSP in particular?&lt;/h2&gt;
&lt;p&gt;(1) The problem has an amazing history of serving as an &lt;strong&gt;engine of discovery&lt;/strong&gt; for applied mathematics, with several &lt;a href=&#34;https://en.wikipedia.org/wiki/John_von_Neumann&#34;&gt;legendary&lt;/a&gt; &lt;a href=&#34;https://en.wikipedia.org/wiki/Richard_E._Bellman&#34;&gt;computer&lt;/a&gt; &lt;a href=&#34;https://en.wikipedia.org/wiki/George_Dantzig&#34;&gt;scientists&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Edsger_W._Dijkstra&#34;&gt;mathematicians&lt;/a&gt; having a crack at it. Here&amp;rsquo;s an &lt;a href=&#34;https://www.youtube.com/watch?v=q8nQTNvCrjE&#34;&gt;amazing talk&lt;/a&gt; by William Cook, the co-inventor of the current state-of-the-art Concorde TSP solver.&lt;/p&gt;
&lt;p&gt;(2) TSP has been the focus of intense research in the combinatorial optimization community. If you come up with a new solver, &lt;em&gt;e.g.&lt;/em&gt;, a learning-driven solver, you &lt;em&gt;need&lt;/em&gt; to benchmark it on TSP. TSP’s &lt;strong&gt;multi-scale nature&lt;/strong&gt; makes it a challenging graph task which requires reasoning about both local node neighborhoods as well as global graph structure.&lt;/p&gt;
&lt;p&gt;(3) Learning-based approaches for heuristic algorithms have the potential to be a breakthrough for OR if they are able to  learn efficiently on small scale problems and then generalize robustly to larger instances.
However, such &lt;em&gt;scale-invariant&lt;/em&gt; generalization is an exciting and unsolved challenge, not just for TSP, but for machine learning as a whole.
&lt;strong&gt;Update:&lt;/strong&gt; We explore this in our &lt;a href=&#34;https://arxiv.org/abs/2006.07054&#34;&gt;latest paper&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://imgs.xkcd.com/comics/travelling_salesman_problem.png&#34; alt=&#34;XKCD:TSP&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;At the same time, the more &lt;em&gt;profound&lt;/em&gt; motivation of using deep learning for combinatorial optimization is not to outperform classical approaches on well-studied problems.
Neural networks can be used as a general tool for tackling previously &lt;em&gt;un-encountered&lt;/em&gt; NP-hard problems, especially those that are &lt;strong&gt;non-trivial to design heuristics for&lt;/strong&gt; [&lt;a href=&#34;https://arxiv.org/abs/1611.09940&#34;&gt;Bello &lt;em&gt;et al.&lt;/em&gt;, 2016&lt;/a&gt;].
We are excited about recent applications of neural combinatorial optimization for &lt;a href=&#34;https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery&#34;&gt;accelarating drug discovery&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1910.01578&#34;&gt;optimizing operating systems&lt;/a&gt; and &lt;a href=&#34;https://arxiv.org/abs/2004.10746&#34;&gt;designing computer chips&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;P.S.&lt;/strong&gt; XB is organizing an exciting workshop at IPAM titled &lt;a href=&#34;http://www.ipam.ucla.edu/programs/workshops/deep-learning-and-combinatorial-optimization/&#34;&gt;&amp;ldquo;Deep Learning and Combinatorial Optimization&amp;rdquo;&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Quantum Chemistry</title>
      <link>https://graphdeeplearning.github.io/project/chemistry/</link>
      <pubDate>Tue, 17 Sep 2019 22:20:35 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/project/chemistry/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Spatial Graph ConvNets</title>
      <link>https://graphdeeplearning.github.io/project/spatial-convnets/</link>
      <pubDate>Tue, 17 Sep 2019 22:20:35 +0800</pubDate>
      <guid>https://graphdeeplearning.github.io/project/spatial-convnets/</guid>
      <description>&lt;h2 id=&#34;non-euclidean-and-graph-structured-data&#34;&gt;Non-Euclidean and Graph-structured Data&lt;/h2&gt;
&lt;p&gt;Classic deep learning architectures such as &lt;a href=&#34;http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf&#34;&gt;Convolutional Neural Networks (CNNs)&lt;/a&gt; and &lt;a href=&#34;https://www.bioinf.jku.at/publications/older/2604.pdf&#34;&gt;Recurrent Neural Networks (RNNs)&lt;/a&gt; require the input data domain to be regular, such as 2D or 3D Euclidean grids for Computer Vision and 1D lines for Natural Language Processing.&lt;/p&gt;
&lt;p&gt;However, real-world data beyond images and language tends to an underlying structure that is &lt;strong&gt;non-Euclidean&lt;/strong&gt;.
Such complex data commonly occurs in science and engineering, and can be modelled intuitively by &lt;strong&gt;heterogeneous graphs&lt;/strong&gt;.
Prominent examples include graphs of molecules, 3D meshes in computer graphics, social networks and biological networks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;graph-data.png&#34; alt=&#34;Graph structured data&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;graph-neural-networks&#34;&gt;Graph Neural Networks&lt;/h2&gt;
&lt;p&gt;Obtaining insights from large and complex graph-structured datasets leads to an interesting challenge for machine learning architectures:
The popular CNN and RNN models need to be redesigned for handling non-Euclidean data, as they cannot leverage familiar regularities such as coordinate systems, vector space structure, or shift invariance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Graph/Geometric Deep Learning&lt;/strong&gt; is an umbrella term for emerging techniques attempting to generalize deep neural networks to non-Euclidean domains such as graphs and manifolds [&lt;a href=&#34;https://arxiv.org/abs/1611.08097&#34;&gt;Bronstein &lt;em&gt;et al.&lt;/em&gt;, 2017&lt;/a&gt;].&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    We are interested to designing neural networks for arbitrary graphs in order to solve &lt;em&gt;generic&lt;/em&gt; graph problems, such as vertex classification, graph classification and graph generation.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;These Graph Neural Network (GNN) architectures are used as backbones for challenging domain-specific applications in a myriad of domains, including &lt;a href=&#34;https://arxiv.org/abs/1704.01212&#34;&gt;chemistry&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1902.06673&#34;&gt;social networks&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1806.01973&#34;&gt;recommendations&lt;/a&gt; and &lt;a href=&#34;https://arxiv.org/abs/1611.08402&#34;&gt;computer graphics&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;basic-formalism&#34;&gt;Basic Formalism&lt;/h2&gt;
&lt;p&gt;Each GNN layer computes $d$-dimensional representations for the nodes/edges of the graph through recursive neighborhood diffusion (&lt;em&gt;a.k.a.&lt;/em&gt; message passing), where each graph node gathers features from its neighbors to represent local graph structure.
Stacking $L$ GNN layers allows the network to build node representations from the &lt;strong&gt;$L$-hop neighborhood&lt;/strong&gt; of each node.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;gnn-layer.png&#34; alt=&#34;GNN Layer&#34;&gt;&lt;/p&gt;
&lt;p&gt;Let $h_i^{\ell}$ denote the feature vector at layer $\ell$ associated with node $i$.
The updated features $h_i^{\ell+1}$ at the next layer $\ell+1$ are obtained by applying non-linear transformations to the central feature vector $h_i^{\ell}$ and the feature vectors $h_{j}^{\ell}$ for all nodes $j$ in the neighborhood of node $i$ (defined by the graph structure).
This guarantees the transformation to build local reception fields, such as in standard ConvNets for computer vision, and be invariant to both graph size and vertex re-indexing.&lt;/p&gt;
&lt;p&gt;Thus, the most generic version of a feature vector $h_i^{\ell+1}$ at vertex $i$ at the next layer in the GNN is:
\begin{equation}
h_{i}^{\ell+1} =  f \left( \ h_i^{\ell} \  , \ { h_{j}^{\ell}: j \rightarrow i }  \ \right) ,
\end{equation}
where ${ j \rightarrow i }$ denotes the set of neighboring nodes $j$ pointed to node $i$, which can be replaced by ${ j \in \mathcal{N}_i }$, the set of neighbors of node $i$, if the graph is undirected.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;classes-of-gnn-architectures&#34;&gt;Classes of GNN Architectures&lt;/h2&gt;
&lt;p&gt;In other words, a GNN is defined by a mapping $f$ taking as input a vector $h_i^{\ell}$&amp;ndash;the feature vector of the center vertex&amp;ndash;as well as an un-ordered set of vectors {${ h_{j}^{\ell}}$}&amp;ndash;the feature vectors of all neighboring vertices.
The arbitrary choice of the mapping $f$ defines an instantiation of a class of GNNs, &lt;i&gt;e.g.&lt;/i&gt;, &lt;a href=&#34;https://arxiv.org/abs/1609.02907&#34;&gt;GCN&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1706.02216&#34;&gt;GraphSage&lt;/a&gt;, &lt;a href=&#34;https://arxiv.org/abs/1810.00826&#34;&gt;GIN&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As an illustration, here&amp;rsquo;s a simple-yet-effective Graph ConvNet from &lt;a href=&#34;https://arxiv.org/abs/1605.07736&#34;&gt;Sukhbaatar &lt;em&gt;et al.&lt;/em&gt;, 2016&lt;/a&gt;:
\begin{equation}
h_{i}^{\ell+1} = \text{ReLU} \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}_i} V^{\ell} h_{j}^{\ell} \Big),
\end{equation}
where $U^{\ell}, V^{\ell} \in \mathbb{R}^{d \times d}$ are the learnable parameters.&lt;/p&gt;
&lt;p&gt;In a &lt;a href=&#34;https://graphdeeplearning.github.io/publication/dwivedi-2020-benchmark/&#34;&gt;recent paper&lt;/a&gt; on benchmarking GNN architectures, we introduced &lt;strong&gt;block diagrams&lt;/strong&gt; to intuitively describe feature update equations such as the one above:&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;gcn-block.png&#34; &gt;
&lt;img data-src=&#34;gcn-block.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;45%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;hr&gt;
&lt;h2 id=&#34;anisotropic-gnns&#34;&gt;Anisotropic GNNs&lt;/h2&gt;
&lt;p&gt;As graphs have no specific orientations (like up, down, left, right directions in images), message-passing layers such as Sukhbaatar&amp;rsquo;s Graph ConvNet are &lt;strong&gt;isotropic&lt;/strong&gt;, treating all neighbors as equally important.
However, this may not be true in general, &lt;em&gt;e.g.&lt;/em&gt;, in social network graphs, neighbors in the same community share different relationships and information compared to neighbors from separate communities.&lt;/p&gt;
&lt;p&gt;Isotropic GNNs can be upgraded to make the diffusion process &lt;strong&gt;anisotropic&lt;/strong&gt; through mechanisms which learn to weigh neighbors based on their relative importance.
For example, &lt;a href=&#34;https://arxiv.org/abs/1703.04826&#34;&gt;Marchegiani and Titov, 2017&lt;/a&gt; upgrade Graph ConvNets by introducing &lt;strong&gt;edge gating&lt;/strong&gt; for learning information flow on the graph structure for the task at hand:
\begin{equation}
h_{i}^{\ell+1} = \text{ReLU} \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}_i} \eta_{ij} \odot V^{\ell} h_{j}^{\ell} \Big),
\quad \text{where } \eta_{ij} = \sigma \big( A^{\ell} h_i^{\ell} + B^{\ell} h_j^{\ell} \big),
\end{equation}
$U^{\ell}, V^{\ell}, A^{\ell}, B^{\ell} \in \mathbb{R}^{d \times d}$ are the learnable parameters, $\sigma$ is the sigmoid function, $\odot$ is the element-wise product, and $\eta_{ij}$ act as edge gates.&lt;/p&gt;


&lt;figure&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;gated-gcn-block.png&#34; &gt;
&lt;img data-src=&#34;gated-gcn-block.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;60%&#34; &gt;&lt;/a&gt;


&lt;/figure&gt;

&lt;p&gt;Other prominent approaches to introduce anisotropy into GNNs include &lt;a href=&#34;https://arxiv.org/abs/1710.10903&#34;&gt;GAT&lt;/a&gt;, which uses the &lt;a href=&#34;https://arxiv.org/abs/1706.03762&#34;&gt;attention mechanism&lt;/a&gt; from NLP, as well as &lt;a href=&#34;https://arxiv.org/abs/1611.08402&#34;&gt;MoNet&lt;/a&gt;, which relies on gaussian mixture models of graph connectivity.
&lt;strong&gt;Through &lt;a href=&#34;(https://graphdeeplearning.github.io/publication/dwivedi-2020-benchmark/)&#34;&gt;our benchmark&lt;/a&gt;, we found anisotropic aggregation to be a key property of powerful GNNs.&lt;/strong&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Graph Convolutional Neural Networks for Molecule Generation and Travelling Salesman Problem</title>
      <link>https://graphdeeplearning.github.io/talk/ipam-may2019/</link>
      <pubDate>Tue, 21 May 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/talk/ipam-may2019/</guid>
      <description></description>
    </item>
    
    <item>
      <title>An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem</title>
      <link>https://graphdeeplearning.github.io/publication/joshi-2019-efficient/</link>
      <pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/joshi-2019-efficient/</guid>
      <description></description>
    </item>
    
    <item>
      <title>GraphTSNE: A Visualization Technique for Graph-Structured Data</title>
      <link>https://graphdeeplearning.github.io/publication/leow-2019-graphtsne/</link>
      <pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/leow-2019-graphtsne/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Reading Group</title>
      <link>https://graphdeeplearning.github.io/reading-group/</link>
      <pubDate>Thu, 28 Jun 2018 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/reading-group/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Being able to articulate and explain ideas is the true test of having learned something.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Starting 2020, the Graph Deep Learning Lab will host a weekly/bi-weekly paper discussion and reading group.
We&amp;rsquo;ll cover the latest and greatest papers from the Graph Neural Networks community as well as general machine learning.
We plan to keep things simple: talk about the key contributions and results, followed by discussions; no need for beautiful slides or long talks.&lt;/p&gt;
&lt;p&gt;The primary aim of this reading group is to share ideas, blabber about your favorite papers, and get to know others with common interests.
Its also an opportunity for lab members to keep each other up-to-date and get early feedback on their projects.&lt;/p&gt;
&lt;p&gt;We may have snacks/coffee! ;)&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    &lt;h3 id=&#34;details&#34;&gt;Details&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;When&lt;/strong&gt;: Weekly/bi-weekly, &lt;a href=&#34;mailto:chaitanya-joshi@ntu.edu.sg?subject=%5BGDL%20Reading%20Group%5D&#34;&gt;join the mailing list&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Where&lt;/strong&gt;: MICL Lab Meeting Room, Block N4, N4-B1C-17, SCSE, NTU&lt;br&gt;
&lt;strong&gt;Duration&lt;/strong&gt;: Usually 30-45 minutes&lt;br&gt;
&lt;strong&gt;Contact&lt;/strong&gt;: &lt;a href=&#34;https://graphdeeplearning.github.io/authors/chaitanya-joshi/&#34;&gt;Chaitanya Joshi&lt;/a&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;(This page will be regularly updated with content from past sessions.)&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Convolutional Neural Networks on Graphs</title>
      <link>https://graphdeeplearning.github.io/talk/ipam-feb2018/</link>
      <pubDate>Wed, 07 Feb 2018 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/talk/ipam-feb2018/</guid>
      <description></description>
    </item>
    
    <item>
      <title>An Experimental Comparison of Text Classification Techniques</title>
      <link>https://graphdeeplearning.github.io/publication/lakhotia-2018-experimental/</link>
      <pubDate>Mon, 01 Jan 2018 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/lakhotia-2018-experimental/</guid>
      <description></description>
    </item>
    
    <item>
      <title>An Experimental Study of Neural Networks for Variable Graphs</title>
      <link>https://graphdeeplearning.github.io/publication/bresson-2018-experimental/</link>
      <pubDate>Mon, 01 Jan 2018 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/bresson-2018-experimental/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Residual Gated Graph ConvNets</title>
      <link>https://graphdeeplearning.github.io/publication/bresson-2017-residual/</link>
      <pubDate>Sun, 01 Jan 2017 00:00:00 +0000</pubDate>
      <guid>https://graphdeeplearning.github.io/publication/bresson-2017-residual/</guid>
      <description></description>
    </item>
    
  </channel>
</rss>