Bugs in LLM Training – Gradient Accumulation Fix

imjonse · 2024-10-18T05:28:29 1729229309

Same issue described on HF: https://huggingface.co/blog/gradient_accumulation

It also highlights the main disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately.

danielhanchen · 2024-10-18T18:34:56 1729276496

Unfortunately transformers is a general library for many models, and so there are tonnes of different architectures. Unfortunately copy paste and changing some parts of the arch is the only way feasible in the meantime.

CraigJPerry · 2024-10-18T11:25:48 1729250748

>> disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately

What are the best tools we have available for tackling this kind of large scale copy-paste change?

https://github.com/huggingface/transformers/pull/34191/commi...

This feels too complex to tackle with PyCharm structural find and replace, even a more powerful structural find and replace like https://comby.dev/ feels underpowered here.

Sourcegraph batch changes? That solves broadcasting the change but doesn’t help with capturing the change to make.

Open rewrite? The python implementation is early stages, not prod ready as I understand it. Plus this change is too complex to use refaster templates even if we could use orw so you’d be debugging a fairly involved method visitor which in this case is probably orders of magnitude more time consuming than just making the changes manually.

What else is there that I don’t know about?

danielhanchen · 2024-10-18T18:36:18 1729276578

Ye a complete change was necessary for now - HF had to isolate the cross entropy loss and make another class for it, and it had to be applied to all model archs.

danielhanchen · 2024-10-18T05:16:03 1729228563

Oh hey! :) TLDR naively gradient accumulation was over-weighting short sequence lengths in LLM finetuning and training runs, and under-weighting long sequence lengths.

For eg a text with sequence lengths of [1, 100] would be scaled by 1/(100+1) in full batch training, but grad accum of 2 would weight [1] as 1/1 * 1/2 = 1/2, whilst [100] as 1/100 * 1/2 = 1/200. (1/2 since grad accum needs to divide by the # of grad accum steps)

ejddhbrbrrnrn · 2024-10-18T12:17:05 1729253825

Is this a general issue rather than unsloth specific. How wide is this problem? Sounds wild if it has been affecting everyones training.

danielhanchen · 2024-10-18T18:32:01 1729276321

Unfortunately it's not an Unsloth issue but a general issue affecting nearly all trainers which use grad accum. We worked with Huggingface so their trainers should be fixed now though in the main branch

xcodevn · 2024-10-18T08:26:01 1729239961

Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.

oergiR · 2024-10-18T08:57:25 1729241845

That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.

danielhanchen · 2024-10-18T18:38:18 1729276698

Yes exactly- mathematically it was incorrect to begin with.

pama · 2024-10-18T13:10:47 1729257047

Although there may be uses for such a modified loss, based on the tone of the writeup it feels like this was an unintended bug in their training code. Training llms with variable max sequence length on different GPU is a recipe for inefficient training anyways, so careful optimizion of MFU at scale, or fixed max sequence length per batch, would have avoided this “bug”.

danielhanchen · 2024-10-18T18:37:46 1729276666

Ye one way to fix it is to use fixed sequence lengths, but it'll still be a tad bit off. Packing say 1000 small sequences to fit a large sequence lengths still will incur the same issue since the denominator will be off by 1000, but yes the problem is much less pronounced.

pama · 2024-10-18T19:45:37 1729280737

Not sure what you meant here; of course one needs to still correctly estimate the cross-entropy loss in the end (in order to keep their sanity, or compare to runs with different total batch size), but each mini-batch term has the same relative contribution to the entropy.

Edit: oh, I guess you probably meant that the code was previously not averaging correctly even for the case of same total mini-batch length... I haven't looked at this code.

danielhanchen · 2024-10-18T23:21:16 1729293676

Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.

pama · 2024-10-20T15:36:19 1729438579

Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).

danielhanchen · 2024-10-18T08:29:33 1729240173

Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent