That's the thing with transformers, right? It doesn't actually "know" anything about its inputs.
The embeddings are learned (initialized to random).
That's the thing with transformers, right? It doesn't actually "know" anything about its inputs.
The embeddings are learned (initialized to random).