Transformer was really about making it feasible to parallelize training of large...

Transformer was really about making it feasible to parallelize training of large context windows. More like using large compute resources effectively when available than saving on compute. This is also why it looks quite ad-hoc compared to simpler models like RNN or LSTM, which can be trained in large batches of data (hence still making use of parallelism to some extent) but have to be serial along the context dimension.