Other than stating there was one, they didn’t show a benefit of this over something like a Wide and Deep model, DCNv2 model, or even a vanilla NN. Transformers make sense if you need to use something N items ago as context (as in text) where N is large. But in their example, any model which takes the last ~5 or so interactions should be able to quickly understand contextual user preferences.
A transformer may also be larger than their baseline, but you still need to justify how those parameters are allocated.
Empirically, I have found that user action sequences are a good way to model user behavior since it can look at several different scales, and specific behaviors. Interest tracking can see what a user generally likes, and the last few actions can help the model see what the user is listening to right now. But with a full sequence, you can start to model things like what the user is listening to right now, what they've been listening to recently, what they tend to listen to at this time of day, how much of a change in genre they could enjoy, etc.
A transformer may also be larger than their baseline, but you still need to justify how those parameters are allocated.