I hear people say that a lot, but is that really how people at the leading edge ...

dartos · on Nov 28, 2023

Most advances in NLP with transformers over the last 2 years has been random trial and error and just throwing more compute at transformers.

Some models like RWKV (which isn’t a transformer model) explain the design decisions in their paper, but generally that’s not the case.

Nobody knows why, nobody seems to be looking into it.

We’re all just trying to figure out how to improve the outcome atm.

cma · on Nov 28, 2023

There have been huge advances in the mathematics of neural networks from Greg Yang (formerly of Microsoft). This allowed predictable training-hyperparameter transfer from smaller versions of GPT-4 where they could be tuned, to the final large model.

https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

He has proofs and theorems about frontiers of maximal feature learning before things devolve into equivalent to kernel methods, and more: a whole bunch of breakthrough math making deep links with random matrix theory.