Ye one way to fix it is to use fixed sequence lengths, but it'll still be a tad ...

pama · 2024-10-18T19:45:37 1729280737

Not sure what you meant here; of course one needs to still correctly estimate the cross-entropy loss in the end (in order to keep their sanity, or compare to runs with different total batch size), but each mini-batch term has the same relative contribution to the entropy.

Edit: oh, I guess you probably meant that the code was previously not averaging correctly even for the case of same total mini-batch length... I haven't looked at this code.

danielhanchen · 2024-10-18T23:21:16 1729293676

Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.

pama · 2024-10-20T15:36:19 1729438579

Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).