The article seems to be a bit light on details for an "overview" of GNNs.
It's an area I've recently been researching and they do seem to be gaining
a significant amount of traction. If anyone is interested in additional reading
material, I can suggest the very recent GNNs: Models and Applications (slide deck available on the website) [0].
There is also a fairly comprehensive GitHub repo on [1], though I
personally haven't given it a detailed look yet.
Has anyone else noticed the connected layers in neural networks and then wondered if a more generalized topology such as a directed graph could be applied to neural networks the first time they were introduced to the concept some decades ago and then realized that they must not have been the first one to notice and therefore the layered topology must have some mathematical superiority over a generalized form but never found a concrete answer as to why?
There's been some work in this area. In practice, finding a good topology is hard, and graph or sparse matrix operations require a pretty significant level of sparsity before they're more efficient than just including every weight in a sense matrix and setting some parameters to zero.
Any DAG you choose is equivalent to a network with some number of dense layers with some of the weights zeroed, so you aren't losing any modeling capability by sticking with dense networks. The current trend of massively overparameterizing networks and training them for a long time in the zero error regime (so-called "double dip" error) exploits this idea a little. With sufficient regularization, you bias the network toward a _simple_ explanation -- one where most weights are near zero, effectively the same as having chosen an optimal topology from the beginning.
If you're talking about cyclic directed graphs, those are implemented in places too, but they're extremely finicky to get right. You start having to worry about a time component to any signal propagation, you have to worry about feedback loops and unbounded signals, they're harder to get to converge during training, and so on. Afaik there isn't a solid theoretical reason why you might want to add cycles since the layered approach can already handle arbitrary problems (not that we shouldn't keep investigating -- I'm sure some people know more than me on the topic, and I don't think there's any definitive proof that cycles are always worse either, so it seems like it might be worth investigating even from a practical point of view).
> Isn't it just that backpropagation on the layered topology is relatively straightforward? That's not to say you can't write a backpropagation on an arbitrary digraph...
Moreover, any arbitrary digraph can be expressed as a layered topology (possibly with a lot of 0-weights). Since there's no fundamental difference you might as well work with whatever's easiest to compute with.
I was, and still am, intrigued by the idea of Boltzmann machines, which form a complete graph between a number of "neurons". With the right weights they could be shaped into any architecture. They could be shaped into a fancy recurrent network, or a multi-layer linear network, or anything in between. Indeed, with the right training algorithm the computer could learn how to structure the "neurons" it is given. I don't think we know any such training algorithm though.
They're also not very efficient, because every "feed forward step" you would have to use a matrix that is N*N (where N is the number of neurons), which is a worst case scenario. Maybe with sparse matrices it could be reasonably efficient if most weights were zero. I don't think sparse matrices are used much in machine learning currently.
These are my thoughts as a machine learning novice.
The vanishing gradient problem makes this hard. ResNets and LSTMs use additional connections to help with this (and ironically that makes them more graph like).
It should be noted that the described graph embedding related tasks are only a small subset of the tasks that GNNs solve. Many (if not most) graph learning techniques focus on more "local" tasks like node classification or edge prediction.
it's not conference season
Weird ^^
Imagine that I'm a scientist and that I made a big discovery X during winter.
But for audience/visibility I only want to publish my results on conference Y during summer.
Somebody, right before summer make the same discovery than mine.
How do I protect the fact that I am the first one discoverer if I publish after the second?
> Also were there any new real SOTA on any NLP tasks since last summer? I feel like accuracy progress has frozen..
Just in the last week there are two papers which claim SOTA on different tasks.
Microsoft released Turing NLG (https://www.microsoft.com/en-us/research/blog/turing-nlg-a-1...) recently which claims SOTA on a couple tasks, it seems like the same transformer architecture, but with more layers and parameters, made feasible by training efficiency improvements. The biggest one seems to be partitioning how the model learns across different processes instead of replicating those states, which significantly improves communication, memory overhead, and training speed.
Deepmind released the Compressive Transformer (https://deepmind.com/blog/article/A_new_model_and_dataset_fo...) which claims SOTA on two other "long-range" benchmarks. My understanding of the improvement here is that instead of discarding older states, in the traditional attention layer, the compressive transformer learns which states to keep, and which states to remove.
I think these are two good examples of paper archetypes -- one where SOTA is achieved through more layers/training data/neurons (and the more interesting contribution is the improvements to model parallelism in training), and one where SOTA is achieved through a new/improved model architecture.
I wonder, for most industrial practitioners, how much either paper is useful though. The Microsoft paper helps for training billion parameter models, but most won't train a model that deep; the Deepmind paper helps for training models over very long sequences, but most people aren't using book-length sequences.*
* I remember reading somewhere that the attention mechanisms tend to only remember around 5 states (would love to see a source or study on this), which is pretty short, so would be interesting to try this model and see if the compressed transformer/attention mechanism can remember longer sequences.
yes, true. there has been progress on NLP SOTA, just not from graph NN architectures. I have thought about using graph NN's where I am currently using RNN's, but they seem fairly immature / difficult to train over very large datasets so have not made much progress.
On the contrary, Graph NN's I've found Graph NNs extraordinarily useful and very easy to train!
I'm not sure what kind of problems you are trying to use them for, but generally they are really useful for node classification or recommendation type tasks.
So for example in the knowledge graph context, you can do things like give it a tiger, jaguar and panther and it will find nodes like lions and leopards.
I spent some time experimenting with https://github.com/facebookresearch/PyTorch-BigGraph and a few other libraries, I ran into some challenges given that my dataset was very large (O(100M) edges) and very sparse; and didn't pursue it further.
The papers you linked are very interesting, I will have to dig further! One more recent writeup: https://eng.uber.com/uber-eats-graph-learning/ -- a real production use case, seems promising to explore more.
Graph Neural networks are currently used a lot in the neural networks for drug discovery space. They significantly beat RNN and CNN baseline/complex versions equivalents on the same datasets(Tox21, QM9 efc).
It's an area I've recently been researching and they do seem to be gaining a significant amount of traction. If anyone is interested in additional reading material, I can suggest the very recent GNNs: Models and Applications (slide deck available on the website) [0].
There is also a fairly comprehensive GitHub repo on [1], though I personally haven't given it a detailed look yet.
[0] http://cse.msu.edu/~mayao4/tutorials/aaai2020/
[1] https://github.com/benedekrozemberczki/awesome-graph-classif...