I think it's helpful to categorize the things that go into an ML accelerator as those that are big picture architectural - things like memory bandwidth and sizes, support for big operations like transposition, etc., -- and those that are fixed-function optimizations. In all of these systems, there's a compiler that's responsible for taking higher-level things and compiling them down to those low-level operations. And that includes the derivatives used in backprop - they just get mapped to the same plus a few more primitive operations. While there are few more fixed functions you need to add for loss functions and some derivatives, probably the largest difference is that you need to support transpose (and that you need all that extra memory & bandwidth to keep those intermediate products around in order to backprop on them)
Look for the section "CHALLENGES AND OPPORTUNITIES OF BUILDING ML HARDWARE"
But then things change more when you want to start supporting embeddings, so Google's TPUs have included a "sparse core" to separately handle those (the lookup and memory use patterns are drastically different from that of the typical dense matrix operations used for non-embedding layers) since TPUv2: https://arxiv.org/pdf/2304.01433.pdf