That is definitely the next thing I would try :) mostly the reason why I started with a BiLSTM is that it's much easier to implement/debug, also afaik the time complexity of RNNs with respect to the sequence length is O(N) but it's O(N^2) for attentional models like a Transformer. Although it probably doesn't matter much on the scale of the SST-2 dataset.
Afaik the more you tighten a bottleneck the more accuracy you lose, and much faster than you gain interpretability. My guess is that such abstractions would require very powerful "priors" (as in, knowledge stored in the network as opposed to being stored in the representation) that humans gained with evolution and that today's models don't possess.