Yes so making the sequences all the same ie through packing them into one still ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

danielhanchen 84 days ago | parent | context | favorite | on: Bugs in LLM Training – Gradient Accumulation Fix

Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.

pama 82 days ago [–]

Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact