Sorry maybe I should have added more explanation. One way to think about attenti...

Sorry maybe I should have added more explanation. One way to think about attention, which is the main distinguishing element in a transformer, is as an adaptable matrix. A feedforward layer is a matrix with static entries that do not change at inference time (only during training). The attention mechanism offers a way to have adaptable weight matrices at inference time (this is implemented by using three different matrices, K,Q & V called keys query and value in case you want to dig deeper).