>All the costs and complexity are tied up in authoring good training data, not compute.
No it's compute. Post-Training by Human reinforcement is not necessary. Anthropic employs RLAIF and it works just fine. Some don't bother with reinforcement learning at all and just leave it at fine-tuning on Instruction-response pairs.
The work being done to pre or post-training data is insignificant in comparison.
You don't need 100k instruct-tuning examples. The vast majority of instruct-tuning data is nowhere maxing even a 4k context.
100k pre-training runs would probably be very helpful but the thing stopping that from happening even in domains that regularly match or exceed that context (fiction, law, code, etc) is the ridiculous compute it would require to train with that much context.
I don't think it's fair to say compute is the blanket bottleneck. Meta has shown that they have plenty of compute to throw at problems but they're still resorting to generating synthetic data to try and improve the quality of their models, i.e. data is the bottleneck and they can afford to burn compute to get ahead there.
Phi-2 has shown that a very reasonable compute budget can produce a phenomenal model if the data is pristine and well chosen, and the model is trained well.
Yes. But consider this scenario: You get some cloud provider to offer you $90m in credits that are convertible to equity. Then you go to an investor and say, you've raised $90m in cash so far, because the bottleneck is compute and compute is expensive. Then they're like, "Oh well a $10m cash investment is reasonable then. Compute is the bottleneck." You go out into the world and say you have a $100m raise on a $500m post for your idea to train an LLM on some niche.
Nevermind that if the cloud provider is willing to forgo $90m in hard cash to give somebody else those GPUs "for free," it must not be that expensive in actuality. I mean from their point of view, $90m in cloud credits might cost them only $1m to provide your first year, when you get around to using just a little of them.
I guess I am saying that there are a ton of people in startups right now, in this gold rush, for whom "compute is the bottleneck" is an essential narrative to their survival / reason for investing cash. It's not just the chip vendors and the cloud providers. My scenario illustrates how "compute is the bottleneck" turns bad napkin ideas into $10m slush funds. So 4/5 actors in Elad's chart benefit from this being the case.
It's the foundational model people like you're saying with Meta who are not bottlenecked by compute neither in fiction nor reality. I'd argue that not only is hardly anyone bottlenecked by compute, but that the people claiming that they are bottlenecked by compute are either guilty of not being informed enough about their problems or guilty of making up a story that results in the easiest-to-obtain investment dollars. Like imagine the counterfactual: Anthropic goes and pitches a synthetic data thing. The tech community mocks that, even if it makes a ton of sense and is responsible for huge demos like Sora.
I'm not saying data isn't important or even that it's not the most important thing.
I'm making a statement of the current economics. Most of the cost of building a LLM comes from compute. The changes we make to the data while meaningful are dwarfed in cost by the compute required to train.
That's not true in the case of the phi-2 model though, and if the approach of phi-2 was scaled up to a class of hyper-efficient LLMs I think it would continue to not be true while also producing SOTA results.
>That's not true in the case of the phi-2 model though
Isn't it ? Genuinely, what did phi-2 do exactly that is economically more expensive than compute ?
From what I read, phi-2's training data was gotten from:
1. Synthetic Data Generation from more performant models.
phi-2 is essentially not possible without spending the compute on that better model in the first place. And inference is much cheaper than training.
2. Selecting for Text data of a certain kind e.g Textbooks
Meanwhile, it's still taking 14 days to train this tiny model on 96 A100 GPUs
I'm sure the man hour cost of dataset curation on Phi-2 were higher than its training compute cost. Regarding synthetic data, I think that could technically be classified as an asset that is not consumed in the training process, and thus you could amortize its compute cost over quite a bit (or even monetize it). Given that, even if the compute for synthetic data put the total compute cost over curation costs I don't think it's on an order that would contradict my point.
> I'm sure the man hour cost of dataset curation on Phi-2 were higher than its training compute cost.
There's no indication of human curation.
From the first paper
> We annotate the quality of a small subset of these files (about 100k samples) using
GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student
whose goal is to learn basic coding concepts”.
We then use this annotated dataset to train a random forest classifier that predicts the quality of
a file/sample using its output embedding from a pretrained codegen model as features. We note that
unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4
minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples.
>Given that, even if the compute for synthetic data put the total compute cost over curation costs I don't think it's on an order that would contradict my point.
Don't worry that's not even necessary. 96 A100 GPU's for 2 weeks straight is a ~50k USD Endeavour. It'll be somewhat less for the likes of Microsoft but so will inference of GPT-4, GPT-3.
Even if we forget about compute costs of those more powerful models, inference costs of the generated tokens won't be anywhere near that amount.
Compute isn't the majority of costs at all. It's primarily memory capacity and memory bandwidth. Tenstorrent's Grayskull chip has 384 TOPS and only a paltry 8GB memory capacity with roughly 120GB/s memory bandwidth. The compute vs memory ratio is so heavily skewed towards compute its ridiculous. 3200 OPS per byte/s of memory bandwidth. 48000 OPS per byte of memory capacity. How can compute possibly ever be a bottleneck?
The moment you start thinking about connecting these together to form a 100 TB system, you need 12.5k nodes or $10 million. A billion dollar budget means you can afford multiple one petabyte systems. Admittedly, I am ignoring the cost of the data center itself, but given enough money, there is no bottleneck on the compute or memory side at all!
How exactly are you going to find enough data to feed these monstrous petabyte scale models? You won't! Your models will stay small and become cheap over time!
Memory is part of compute. I have no idea why you've separated them.
>How exactly are you going to find enough data to feed these monstrous petabyte scale models? You won't! Your models will stay small and become cheap over time!
People have this idea that we're stretching the limit of tokens to feed to models. Not even close.
It's just web scrapped which means all the writing largely inaccessible on the web (swaths of fiction, textbooks, papers etc) and in books are not a part of it. It doesn't contain content in some of the most written languages on earth (e.g Chinese. You can easily get trilions of tokens of Chinese alone)
No it's compute. Post-Training by Human reinforcement is not necessary. Anthropic employs RLAIF and it works just fine. Some don't bother with reinforcement learning at all and just leave it at fine-tuning on Instruction-response pairs.
The work being done to pre or post-training data is insignificant in comparison.
You don't need 100k instruct-tuning examples. The vast majority of instruct-tuning data is nowhere maxing even a 4k context.
100k pre-training runs would probably be very helpful but the thing stopping that from happening even in domains that regularly match or exceed that context (fiction, law, code, etc) is the ridiculous compute it would require to train with that much context.