Graph Neural Networks – An Overview

asterisk_ · on Feb 17, 2020

The article seems to be a bit light on details for an "overview" of GNNs.

It's an area I've recently been researching and they do seem to be gaining a significant amount of traction. If anyone is interested in additional reading material, I can suggest the very recent GNNs: Models and Applications (slide deck available on the website) [0].

There is also a fairly comprehensive GitHub repo on [1], though I personally haven't given it a detailed look yet.

[0] http://cse.msu.edu/~mayao4/tutorials/aaai2020/

[1] https://github.com/benedekrozemberczki/awesome-graph-classif...

sergioskar · on Feb 17, 2020

Yeah, you probably right. I am always puzzled about how much math to include and how deep to go on an introductory article. Thanks for your feedback

nurettin · on Feb 16, 2020

Has anyone else noticed the connected layers in neural networks and then wondered if a more generalized topology such as a directed graph could be applied to neural networks the first time they were introduced to the concept some decades ago and then realized that they must not have been the first one to notice and therefore the layered topology must have some mathematical superiority over a generalized form but never found a concrete answer as to why?

hansvm · on Feb 16, 2020

There's been some work in this area. In practice, finding a good topology is hard, and graph or sparse matrix operations require a pretty significant level of sparsity before they're more efficient than just including every weight in a sense matrix and setting some parameters to zero.

Any DAG you choose is equivalent to a network with some number of dense layers with some of the weights zeroed, so you aren't losing any modeling capability by sticking with dense networks. The current trend of massively overparameterizing networks and training them for a long time in the zero error regime (so-called "double dip" error) exploits this idea a little. With sufficient regularization, you bias the network toward a _simple_ explanation -- one where most weights are near zero, effectively the same as having chosen an optimal topology from the beginning.

If you're talking about cyclic directed graphs, those are implemented in places too, but they're extremely finicky to get right. You start having to worry about a time component to any signal propagation, you have to worry about feedback loops and unbounded signals, they're harder to get to converge during training, and so on. Afaik there isn't a solid theoretical reason why you might want to add cycles since the layered approach can already handle arbitrary problems (not that we shouldn't keep investigating -- I'm sure some people know more than me on the topic, and I don't think there's any definitive proof that cycles are always worse either, so it seems like it might be worth investigating even from a practical point of view).

dnautics · on Feb 17, 2020

> therefore the layered topology must have some mathematical superiority

Isn't it just that backpropagation on the layered topology is relatively straightforward?

That's not to say you can't write a backpropagation on an arbitrary digraph, but as you get to more and more complex digraphs, things will get harder.

I could be wrong on this.

dodobirdlord · on Feb 17, 2020

> Isn't it just that backpropagation on the layered topology is relatively straightforward? That's not to say you can't write a backpropagation on an arbitrary digraph...

Moreover, any arbitrary digraph can be expressed as a layered topology (possibly with a lot of 0-weights). Since there's no fundamental difference you might as well work with whatever's easiest to compute with.

Buttons840 · on Feb 16, 2020

I was, and still am, intrigued by the idea of Boltzmann machines, which form a complete graph between a number of "neurons". With the right weights they could be shaped into any architecture. They could be shaped into a fancy recurrent network, or a multi-layer linear network, or anything in between. Indeed, with the right training algorithm the computer could learn how to structure the "neurons" it is given. I don't think we know any such training algorithm though.

They're also not very efficient, because every "feed forward step" you would have to use a matrix that is N*N (where N is the number of neurons), which is a worst case scenario. Maybe with sparse matrices it could be reasonably efficient if most weights were zero. I don't think sparse matrices are used much in machine learning currently.

These are my thoughts as a machine learning novice.

RocketSyntax · on Feb 17, 2020

Spiking neural nets are a lot like these "sparse dags"

nl · on Feb 17, 2020

The vanishing gradient problem makes this hard. ResNets and LSTMs use additional connections to help with this (and ironically that makes them more graph like).

crashocaster · on Feb 16, 2020

It should be noted that the described graph embedding related tasks are only a small subset of the tasks that GNNs solve. Many (if not most) graph learning techniques focus on more "local" tasks like node classification or edge prediction.

The_rationalist · on Feb 16, 2020

At which NLP tasks are they state the art? The only one where they are really competitive is dependency parsing. (from my own disjonction of cases)

Also were there any new real SOTA on any NLP tasks since last summer? I feel like accuracy progress has frozen..

What I would love would be to get a notification/mail when a task from which I subscribed got a new SOTA (from paperswithcode.com obviously).

nl · on Feb 17, 2020

Graph NNs aren't really used for SOTA NLP tasks.

On the other hand, larger and larger transformer networks are constantly improving.

Assuming you are in the Northern Hemisphere so "last summer" means Julyish, then Google's T5[1] and Microsoft Turing-NLG[2] come to mind.

I find keeping an eye on HuggingFace's models list[3] is useful for this.

[1] https://arxiv.org/abs/1910.10683

[2] https://www.microsoft.com/en-us/research/blog/turing-nlg-a-1...

[3] https://github.com/huggingface/transformers#model-architectu...

tastroder · on Feb 16, 2020

> At which NLP tasks are they state the art? The only one where they are really competitive is dependency parsing. (from my own disjonction of cases)

At tasks that actually involve graphs presumably. https://arxiv.org/pdf/1901.00596.pdf https://paperswithcode.com/task/graph-classification has a bunch of GNNs ranked #1

> Also were there any new real SOTA on any NLP tasks since last summer? I feel like accuracy progress has frozen..

That's pretty normal for winter/spring, it's not conference season.

The_rationalist · on Feb 16, 2020

it's not conference season Weird ^^ Imagine that I'm a scientist and that I made a big discovery X during winter. But for audience/visibility I only want to publish my results on conference Y during summer. Somebody, right before summer make the same discovery than mine. How do I protect the fact that I am the first one discoverer if I publish after the second?

p1esk · on Feb 16, 2020

You post it on arxiv

The_rationalist · on Feb 17, 2020

So the answer would be the SOTA results are on arxiv but are posted on paperswithcode.com leaderboards only at conference time? Sounds unlikely.

huac · on Feb 16, 2020

> Also were there any new real SOTA on any NLP tasks since last summer? I feel like accuracy progress has frozen..

Just in the last week there are two papers which claim SOTA on different tasks.

Microsoft released Turing NLG (https://www.microsoft.com/en-us/research/blog/turing-nlg-a-1...) recently which claims SOTA on a couple tasks, it seems like the same transformer architecture, but with more layers and parameters, made feasible by training efficiency improvements. The biggest one seems to be partitioning how the model learns across different processes instead of replicating those states, which significantly improves communication, memory overhead, and training speed.

Deepmind released the Compressive Transformer (https://deepmind.com/blog/article/A_new_model_and_dataset_fo...) which claims SOTA on two other "long-range" benchmarks. My understanding of the improvement here is that instead of discarding older states, in the traditional attention layer, the compressive transformer learns which states to keep, and which states to remove.

I think these are two good examples of paper archetypes -- one where SOTA is achieved through more layers/training data/neurons (and the more interesting contribution is the improvements to model parallelism in training), and one where SOTA is achieved through a new/improved model architecture.

I wonder, for most industrial practitioners, how much either paper is useful though. The Microsoft paper helps for training billion parameter models, but most won't train a model that deep; the Deepmind paper helps for training models over very long sequences, but most people aren't using book-length sequences.*

* I remember reading somewhere that the attention mechanisms tend to only remember around 5 states (would love to see a source or study on this), which is pretty short, so would be interesting to try this model and see if the compressed transformer/attention mechanism can remember longer sequences.

nl · on Feb 17, 2020

Neither of these are Graph NNs though (although it was a weird question because Graph NNs solve a different problem)

huac · on Feb 17, 2020

yes, true. there has been progress on NLP SOTA, just not from graph NN architectures. I have thought about using graph NN's where I am currently using RNN's, but they seem fairly immature / difficult to train over very large datasets so have not made much progress.

I was following the work of https://www.octavian.ai/, but have not seen much recently. http://web.stanford.edu/class/cs224w/info.html also looks interesting. looking briefly over the student projects, there appear to be a couple nlp projects (e.g. http://web.stanford.edu/class/cs224w/project/26418192.pdf) but most leverage knowledge graph concepts.

nl · on Feb 17, 2020

On the contrary, Graph NN's I've found Graph NNs extraordinarily useful and very easy to train!

I'm not sure what kind of problems you are trying to use them for, but generally they are really useful for node classification or recommendation type tasks.

So for example in the knowledge graph context, you can do things like give it a tiger, jaguar and panther and it will find nodes like lions and leopards.

https://arxiv.org/pdf/1901.00596.pdf is a survey paper which has a decent overview of the tasks they are useful for. http://research.baidu.com/Public/uploads/5c1c9a58317b3.pdf is using them for QA, but I'm not familiar enough with the dataset to evaluate it fairly.

huac · on Feb 18, 2020

I spent some time experimenting with https://github.com/facebookresearch/PyTorch-BigGraph and a few other libraries, I ran into some challenges given that my dataset was very large (O(100M) edges) and very sparse; and didn't pursue it further.

The papers you linked are very interesting, I will have to dig further! One more recent writeup: https://eng.uber.com/uber-eats-graph-learning/ -- a real production use case, seems promising to explore more.

nl · on Feb 18, 2020

I like https://github.com/facebookresearch/Starspace#graphspace-lin... for fast and easy graph embedding generation via a GNN.

naresh_xai · on Feb 17, 2020

Graph Neural networks are currently used a lot in the neural networks for drug discovery space. They significantly beat RNN and CNN baseline/complex versions equivalents on the same datasets(Tox21, QM9 efc).

RocketSyntax · on Feb 17, 2020

Can you please share an example? I am in genomics and looking to get closer to the drug chemistry.

naresh_xai · on Feb 19, 2020

Happy to share decks and references to your email address. Please share your email address with me :)

mehh · on Feb 17, 2020

Watch the video, its much clearer https://www.youtube.com/watch?v=cWIeTMklzNg

syntaxing · on Feb 16, 2020

Is there any tutorial using graph NN for biomed? I always wanted to learn more how graph NN is applied on the health/medical industry.

ampdepolymerase · on Feb 16, 2020

Look into RDFs and Ontology w.r.t. NNs. A very rich (though unproductive) history.