A few considerations come to mind: 1. The O(N^2\*d) computation cost of the atte...

A few considerations come to mind:

1. The O(N^2*d) computation cost of the attention layers. For large graphs (millions of nodes) it's quickly too costly. And in some of the social network problems, the more data you feed the better is the inference on average (on a ~log scale).

2. As the paper suggests, in some cases the graph structure has important information. Flattening out everything, or fully connecting the nodes, a more accurate description of what goes on in an attention layer in this scenario, the structure is lost. The structure information can be introduced as a positional encoding -- see paper (edited/fixed): https://arxiv.org/pdf/2207.02505.pdf So remember to do that if you attempt the attention solution.

3. Then there is overfitting, already a big issue in GNNs. Fully connecting every node with attention has less of an "inductive bias" if you will, created by the graph structure. Not sure how much it matters ...