Hacker News new | past | comments | ask | show | jobs | submit login

Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.



Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: