One thing I was wondering about regarding transformers, which perhaps someone more knowledgeable can explain: as far as I understand, the attention heads are essentially two-dimensional structures where values related to tokens are compared to each other in a matrix. Has anyone tried to generalize this and make the dimension of the attention-heads higher than two?
I'm not really in this space but I like to read the papers and as I understand it, the typical dimensionality is far higher than 2. For example in the original "All you need is attention" paper, the example they give has 64 dimensions. They're vectors so even though they might be drawn as a matrix, each value represents a distance in a different dimension.
I'm talking about the matrices W_Q, W_K and W_V. My question is why these are matrices (tensors of dimension 2) and not tensors of a higher dimension than 2.
My thinking goes like: a matrix can represent a graph (each entry may correspond to an edge between two nodes), but e.g. a 3-dimensional tensor may correspond to a hypergraph where each entry is a 3-hyperedge, so you can not just talk about the relation between two tokens, but also about the relation between three tokens (in language this could be e.g. subject, object and indirect-object/dative).