Ye one way to fix it is to use fixed sequence lengths, but it'll still be a tad bit off. Packing say 1000 small sequences to fit a large sequence lengths still will incur the same issue since the denominator will be off by 1000, but yes the problem is much less pronounced.
Not sure what you meant here; of course one needs to still correctly estimate the cross-entropy loss in the end (in order to keep their sanity, or compare to runs with different total batch size), but each mini-batch term has the same relative contribution to the entropy.
Edit: oh, I guess you probably meant that the code was previously not averaging correctly even for the case of same total mini-batch length... I haven't looked at this code.
Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.
Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).