LLMs need to compress information to be able to predict next words in as many co...

LLMs need to compress information to be able to predict next words in as many contexts as possible.

Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.