transformers are GNNs where all nodes are connected to all nodes (well in the decoder you have masks but you see my point).
If the problem is that the graph is not sparse enough / not a graph at all, adding more connections doesn't help.
edit: Why doesn't it help? They address this in the paper. 1. There is computational infeasiability problem. 2. The transformer decoder can't be a regular graph if you include masking.
If they don't actually help then the attention weight for that connection would tend to 0, right? Then it becomes a problem of overfitting which we have a large arsenal to combat.
If the problem is that the graph is not sparse enough / not a graph at all, adding more connections doesn't help.
edit: Why doesn't it help? They address this in the paper. 1. There is computational infeasiability problem. 2. The transformer decoder can't be a regular graph if you include masking.