Ok but I have to point out something important here. Presumably, the model you'r...

Difwif · 2025-08-25T15:34:46 1756136086

Seems short sighted to me. LLMs could have any data in their training set encoded as tokens. Either new specialized tokens are explicitly included (e.g: Vision models) or the language encoded version of everything that usually exists (e.g: the research paper and the csv with the data).

To improve next token prediction performance on these datasets and generalize requires a much richer latent space. I think it could theoretically lead to better results from cross-domain connections (ex: being fluent in a specific area of advanced mathematics, quantum mechanics, and materials engineering is key to a particular breakthrough)

anthk · 2025-08-25T21:05:45 1756155945

Then you are now implementing a parser.