I've been working through [0]. Like a lot of math, the notation is daunting, but once you become familiar with it, it really is a nice tool for thought.
This! The best resource I've found to explain transformers, that made them clear to me. I wish all deep learning papers were written like this, using pseudocode.
[0]: https://arxiv.org/abs/2207.09238