I’m still confused. Does it treat the input tokens as a sampled waveform? I mean...

xeonmc · 2025-02-26T11:39:11 1740569951

Do keep in mind that FFT is a lossless, equivalent representation of the original data.

yobbo · 2025-02-26T20:24:39 1740601479

As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).

It seems unlikely to work for language.

jampekka · 2025-02-26T11:37:14 1740569834

It embeds them first into vectors. The input is a real matrix with (context length)x(embedding size) dimensions.

blovescoffee · 2025-02-26T12:47:47 1740574067

No. The FFT is an operation on a discrete domain, it is not the FT. In the same way audio waveforms are processed by an FFT you bucket frequencies which is conceptually a vector. Once you have a vector, you do machine learning like you would with any vector (except you do some FT in this case, I haven’t read the paper).

lta · 2025-02-26T13:06:51 1740575211

Most likely the embedding of the token is passed through FFT